版权所有 © 2019 安德烈·布尔科夫
保留所有权利。本书按照“先读,后买”的原则发行。后者意味着任何人都可以通过任何可用的方式获得这本书的副本,阅读它并与其他人分享。但是,如果您阅读并喜欢这本书,或者发现它有任何帮助或有用,您就必须购买它。如需更多信息,请发送电子邮件至author@themlbook.com。
Copyright © 2019 Andriy Burkov
All rights reserved. This book is distributed on the “read first, buy later” principle. The latter implies that anyone can obtain a copy of the book by any means available, read it and share it with anyone else. However, if you read and liked the book, or found it helpful or useful in any way, you have to buy it. For further information, please email author@themlbook.com.
“所有模型都是错误的,但有些模型是有用的。”
—乔治·博克斯
“如果我有更多时间,我会写一封更短的信。”
—布莱斯·帕斯卡
“All models are wrong, but some are useful.”
— George Box
“If I had more time, I would have written a shorter letter.”
— Blaise Pascal
过去二十年来,海量数据的可用性呈爆炸式增长,相应地,统计和机器学习应用也引起了人们的兴趣。影响是深远的。十年前,当我能够吸引全班 MBA 学生参加我的新统计学习选修课时,我的同事们感到惊讶,因为我们部门很难填补大多数选修课。今天,我们提供商业分析硕士学位,这是学校最大的专业硕士课程,其申请量可与我们的 MBA 课程相媲美。我们的课程数量大幅增加,但学生仍然抱怨课程已满。我们的经验并不独特,随着对这一领域培训的个人需求的不断增长,数据科学和机器学习项目以惊人的速度涌现。
The last twenty years have witnessed an explosion in the availability of enormous quantities of data and, correspondingly, of interest in statistical and machine learning applications. The impact has been profound. Ten years ago, when I was able to attract a full class of MBA students to my new statistical learning elective, my colleagues were astonished because our department struggled to fill most electives. Today we offer a Master’s in Business Analytics, which is the largest specialized master’s program in the school and has application volume rivaling those of our MBA programs. Our course offerings have increased dramatically, yet our students still complain that the classes are all full. Our experience is not unique, with data science and machine learning programs springing up at an extraordinary rate as the demand for individuals trained in this area has blossomed.
这种需求是由一个简单但不可否认的事实驱动的。机器学习方法在社会科学、商业、生物学和医学等众多领域产生了重要的新见解。因此,对具有必要技能的个人的需求巨大。然而,培训学生这些技能一直具有挑战性,因为大多数关于这些方法的早期文献都是针对学术界的,并且集中于拟合算法或结果估计器的统计和理论特性。对于在现实世界问题上实施特定方法时需要帮助的研究人员和从业者来说,几乎没有支持。这些人需要了解可应用于每个问题的方法范围,以及他们的假设、优点和缺点。但拟合算法的理论特性或详细信息远没有那么重要。当我们撰写《R 统计学习简介》(ISLR) 时,我们的目标是为该小组提供资源。收到的热情表明了社区中存在的需求。
This demand is driven by a simple, but undeniable, fact. Machine learning approaches have produced significant new insights in numerous settings such as the social sciences, business, biology and medicine, to name just a few. As a result, there is a tremendous demand for individuals with the requisite skill set. However, training students in these skills has been challenging because most of the early literature on these methods was aimed at academics and concentrated on statistical and theoretical properties of the fitting algorithms or resulting estimators. There was little support for researchers and practitioners who needed help in implementing a given method on real-world problems. These individuals needed to understand the range of methods that can be applied to each problem, along with their assumptions, strengths and weaknesses. But theoretical properties or detailed information on the fitting algorithms were far less important. Our goal when we wrote “An Introduction to Statistical Learning with R” (ISLR) was to provide a resource for this group. The enthusiasm with which it was received demonstrates the demand that exists within the community.
《百页机器学习书》遵循类似的范式。与 ISLR 一样,它跳过了理论推导,有利于为读者提供有关如何实施各种方法的关键细节。这是一本紧凑的“如何进行数据科学”手册,我预测它将成为学者和从业者的首选资源。这本书有 100 页(或多一点),足够短,可以一次读完。然而,尽管篇幅较长,它涵盖了所有主要的机器学习方法,从经典的线性回归和逻辑回归,到现代的支持向量机、深度学习、Boosting 和随机森林。各种方法也不乏详细信息,感兴趣的读者可以通过创新的配套书籍 wiki 获得有关任何特定方法的更多信息。这本书不假设任何高水平的数学或统计培训,甚至编程经验,因此几乎任何愿意花时间学习这些方法的人都应该可以阅读。对于任何在该领域开始攻读博士学位课程的人来说,这当然是必读的,并且将作为他们进一步进步的有用参考。最后,本书使用 Python 代码(最流行的机器学习编码语言之一)说明了一些算法。我强烈推荐《百页机器学习书》给想要了解更多机器学习知识的初学者,以及寻求扩展知识库的经验丰富的从业者。
“The Hundred-Page Machine Learning Book” follows a similar paradigm. As with ISLR, it skips involved theoretical derivations in favor of providing the reader with key details on how to implement the various approaches. This is a compact “how to do data science” manual and I predict it will become a go to resource for academics and practitioners alike. At 100 pages (or a little more), the book is short enough to read in a single sitting. Yet, despite its length, it covers all the major machine learning approaches, ranging from classical linear and logistic regression, through to modern support vector machines, deep learning, boosting, and random forests. There is also no shortage of details on the various approaches and the interested reader can gain further information on any particular method via the innovative companion book wiki. The book does not assume any high level mathematical or statistical training, or even programming experience, so should be accessible to almost anyone willing to invest the time to learn about these methods. It should certainly be required reading for anyone starting a PhD program in this area and will serve as a useful reference as they progress further. Finally, the book illustrates some of the algorithms using Python code, one of the most popular coding languages for machine learning. I would highly recommend “The Hundred-Page Machine Learning Book” for both the beginner looking to learn more about machine learning, and the experienced practitioner seeking to extend their knowledge base.
Gareth James,南加州大学数据科学与运维教授,(与 Witten、Hastie 和 Tibshirani)合着了畅销书《统计学习简介及其在 R 中的应用》
Gareth James, Professor of Data Sciences and Operations at University of Southern California, co-author (with Witten, Hastie and Tibshirani), of the best-selling book An Introduction to Statistical Learning, with Applications in R
让我们先说实话:机器不会学习。典型的“学习机器”的作用是找到一个数学公式,当将该公式应用于输入集合(称为“训练数据”)时,会产生所需的输出。该数学公式还可以为大多数其他输入(与训练数据不同)生成正确的输出,前提是这些输入来自与训练数据来源相同或相似的统计分布。
Let’s start by telling the truth: machines don’t learn. What a typical “learning machine” does, is finding a mathematical formula, which, when applied to a collection of inputs (called “training data”), produces the desired outputs. This mathematical formula also generates the correct outputs for most other inputs (distinct from the training data) on the condition that those inputs come from the same or a similar statistical distribution as the one the training data was drawn from.
为什么这不是学习?因为如果你稍微扭曲输入,输出很可能会变得完全错误。动物的学习并不是这样进行的。如果您通过直视屏幕学会了玩视频游戏,那么即使有人稍微旋转屏幕,您仍然会成为一名优秀的玩家。如果机器学习算法是通过直视屏幕来训练的,除非它也经过训练来识别旋转,否则将无法在旋转的屏幕上玩游戏。
Why isn’t that learning? Because if you slightly distort the inputs, the output is very likely to become completely wrong. It’s not how learning in animals works. If you learned to play a video game by looking straight at the screen, you would still be a good player if someone rotates the screen slightly. A machine learning algorithm, if it was trained by “looking” straight at the screen, unless it was also trained to recognize rotation, will fail to play the game on a rotated screen.
那么为什么叫“机器学习”呢?通常情况下,原因在于营销:美国电脑游戏和人工智能领域的先驱阿瑟·塞缪尔 (Arthur Samuel) 于 1959 年在 IBM 工作时创造了这个术语。与 2010 年代 IBM 试图推销“认知计算”一词以在竞争中脱颖而出类似,在 1960 年代,IBM 使用新的酷术语“机器学习”来吸引客户和有才华的员工。
So why the name “machine learning” then? The reason, as is often the case, is marketing: Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term in 1959 while at IBM. Similarly to how in the 2010s IBM tried to market the term “cognitive computing” to stand out from competition, in the 1960s, IBM used the new cool term “machine learning” to attract both clients and talented employees.
正如你所看到的,就像人工智能不是智能一样,机器学习也不是学习。然而,机器学习是一个普遍认可的术语,通常指的是构建能够执行各种有用操作而无需明确编程的机器的科学和工程。因此,该术语中的“学习”一词是类比动物的学习,而不是字面意思。
As you can see, just like artificial intelligence is not intelligence, machine learning is not learning. However, machine learning is a universally recognized term that usually refers to the science and engineering of building machines capable of doing various useful things without being explicitly programmed to do so. So, the word “learning” in the term is used by analogy with the learning in animals rather than literally.
本书仅包含自 20 世纪 60 年代以来开发的大量机器学习材料中已被证明具有重大实用价值的部分。机器学习的初学者将在本书中找到足够的细节,以轻松地理解该领域并开始提出正确的问题。
This book contains only those parts of the vast body of material on machine learning developed since the 1960s that have proven to have a significant practical value. A beginner in machine learning will find in this book just enough details to get a comfortable level of understanding of the field and start asking the right questions.
有经验的从业者可以将本书作为进一步自我提升的指南集。当您在项目开始时进行头脑风暴时,当您尝试回答给定的技术或业务问题是否是“机器可学习的”问题以及如果是的话,您应该尝试使用哪些技术来解决它时,这本书也会派上用场。
Practitioners with experience can use this book as a collection of directions for further self-improvement. The book also comes in handy when brainstorming at the beginning of a project, when you try to answer the question whether a given technical or business problem is “machine-learnable” and, if yes, which techniques you should try to solve it.
如果你即将开始学习机器学习,你应该从头到尾阅读这本书。 (只有一百页,没什么大不了的。)如果您对书中涵盖的特定主题感兴趣并想了解更多信息,大多数部分都有二维码。
If you are about to start learning machine learning, you should read this book from the beginning to the end. (It’s just a hundred pages, not a big deal.) If you are interested in a specific topic covered in the book and want to know more, most sections have a QR code.
通过用手机扫描其中一个二维码,您将获得该书配套维基theMLbook.com页面的链接,其中包含其他材料:推荐读物、视频、问答、代码片段、教程和其他奖励。本书的 wiki 不断更新,内容来自本书作者本人以及来自世界各地的志愿者的贡献。所以这本书就像一杯好酒,买了之后越喝越香。
By scanning one of those QR codes with your phone, you will get a link to a page on the book’s companion wiki theMLbook.com with additional materials: recommended reads, videos, Q&As, code snippets, tutorials, and other bonuses. The book’s wiki is continuously updated with contributions from the book’s author himself as well as volunteers from all over the world. So this book, like a good wine, keeps getting better after you buy it.
扫描下面的二维码即可访问本书的wiki:
Scan the QR code below to get to the book’s wiki:
有些部分没有二维码,但它们很可能仍然有一个 wiki 页面。您可以通过向 wiki 的搜索引擎提交该部分的标题来找到它。
Some sections don’t have a QR code, but they still most likely have a wiki page. You can find it by submitting the section’s title to the wiki’s search engine.
本书按照“先读,后买”的原则发行。我坚信,在消费内容之前付费就等于买了一头猪。在购买之前,您可以在经销商处查看并试车。您可以在百货商店试穿衬衫或连衣裙。在付费之前,您必须能够阅读一本书。
This book is distributed on the “read first, buy later” principle. I firmly believe that paying for the content before consuming it is buying a pig in a poke. You can see and try a car in a dealership before you buy it. You can try on a shirt or a dress in a department store. You have to be able to read a book before paying for it.
先读后买的原则意味着您可以免费下载本书、阅读并与您的朋友和同事分享。只有当您阅读并喜欢这本书,或者发现它在任何方面有帮助或有用时,您才必须购买它。
The read first, buy later principle implies that you can freely download the book, read it and share it with your friends and colleagues. Only if you read and liked the book, or found it helpful or useful in any way, you have to buy it.
现在一切都准备好了。祝您阅读愉快!
Now you are all set. Enjoy your reading!
机器学习是计算机科学的一个子领域,涉及构建算法,这些算法的有用性依赖于某些现象示例的集合。这些例子可以来自大自然、由人类手工制作或由其他算法生成。
Machine learning is a subfield of computer science that is concerned with building algorithms which, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans or generated by another algorithm.
机器学习也可以定义为通过以下方式解决实际问题的过程:1)收集数据集,2)基于该数据集通过算法构建统计模型。假设该统计模型以某种方式用于解决实际问题。
Machine learning can also be defined as the process of solving a practical problem by 1) gathering a dataset, and 2) algorithmically building a statistical model based on that dataset. That statistical model is assumed to be used somehow to solve the practical problem.
为了减少击键次数,我交替使用术语“学习”和“机器学习”。
To save keystrokes, I use the terms “learning” and “machine learning” interchangeably.
学习可以是监督学习、半监督学习、无监督学习和强化学习。
Learning can be supervised, semi-supervised, unsupervised and reinforcement.
在监督学习1中,数据集是标记示例的集合 。每个元素之中称为特征向量。特征向量是一个向量,其中每个维度包含一个以某种方式描述示例的值。该值称为特征,表示为。例如,如果每个示例在我们的集合中代表一个人,那么第一个特征,,可以包含以厘米为单位的高度,第二个特征,,可以包含以公斤为单位的重量,可以包含性别等。对于数据集中的所有示例,位置处的特征特征向量中总是包含相同类型的信息。这意味着如果在某些示例中包含以千克为单位的重量, 然后每个示例中还将包含以千克为单位的重量,。标签 可以是属于有限类集的元素 ,或实数,或更复杂的结构,如向量、矩阵、树或图。除非另有说明,本书中是有限类集合之一或实数2。您可以将类视为示例所属的类别。例如,如果您的示例是电子邮件,而您的问题是垃圾邮件检测,那么您有两个类。
In supervised learning1, the dataset is the collection of labeled examples . Each element among is called a feature vector. A feature vector is a vector in which each dimension contains a value that describes the example somehow. That value is called a feature and is denoted as . For instance, if each example in our collection represents a person, then the first feature, , could contain height in cm, the second feature, , could contain weight in kg, could contain gender, and so on. For all examples in the dataset, the feature at position in the feature vector always contains the same kind of information. It means that if contains weight in kg in some example , then will also contain weight in kg in every example , . The label can be either an element belonging to a finite set of classes , or a real number, or a more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated, in this book is either one of a finite set of classes or a real number2. You can see a class as a category to which an example belongs. For instance, if your examples are email messages and your problem is spam detection, then you have two classes .
监督学习算法的目标是使用数据集生成采用特征向量的模型作为输入和输出信息,可以推断出该特征向量的标签。例如,使用人员数据集创建的模型可以将描述人员的特征向量作为输入,并输出该人员患有癌症的概率。
The goal of a supervised learning algorithm is to use the dataset to produce a model that takes a feature vector as input and outputs information that allows deducing the label for this feature vector. For instance, the model created using the dataset of people could take as input a feature vector describing a person and output a probability that the person has cancer.
在无监督学习中,数据集是未标记示例的集合。 。再次,是一个特征向量,无监督学习算法的目标是创建一个采用特征向量的模型作为输入,并将其转换为另一个向量或可用于解决实际问题的值。例如,在聚类中,模型返回数据集中每个特征向量的聚类 ID。在降维中,模型的输出是特征向量,其特征少于输入。;在异常值检测中,输出是一个实数,表明如何与数据集中的“典型”示例不同。
In unsupervised learning, the dataset is a collection of unlabeled examples . Again, is a feature vector, and the goal of an unsupervised learning algorithm is to create a model that takes a feature vector as input and either transforms it into another vector or into a value that can be used to solve a practical problem. For example, in clustering, the model returns the id of the cluster for each feature vector in the dataset. In dimensionality reduction, the output of the model is a feature vector that has fewer features than the input ; in outlier detection, the output is a real number that indicates how is different from a “typical” example in the dataset.
在半监督学习中,数据集包含标记和未标记的示例。通常,未标记示例的数量远高于标记示例的数量。半监督学习算法的目标与监督学习算法的目标相同。这里的希望是,使用许多未标记的示例可以帮助学习算法找到(我们可能会说“产生”或“计算”)更好的模型。
In semi-supervised learning, the dataset contains both labeled and unlabeled examples. Usually, the quantity of unlabeled examples is much higher than the number of labeled examples. The goal of a semi-supervised learning algorithm is the same as the goal of the supervised learning algorithm. The hope here is that using many unlabeled examples can help the learning algorithm to find (we might say “produce” or “compute”) a better model.
添加更多未标记的示例可以使学习受益,这看起来可能违反直觉。看来我们给问题增加了更多的不确定性。但是,当您添加未标记的示例时,您会添加有关问题的更多信息:更大的样本可以更好地反映我们标记的数据来自的概率分布。理论上,学习算法应该能够利用这些附加信息。
It could look counter-intuitive that learning could benefit from adding more unlabeled examples. It seems like we add more uncertainty to the problem. However, when you add unlabeled examples, you add more information about your problem: a larger sample reflects better the probability distribution the data we labeled came from. Theoretically, a learning algorithm should be able to leverage this additional information.
强化学习是机器学习的一个子领域,机器“生活”在一个环境中,并且能够将该环境的状态感知为特征向量。机器可以在每种状态下执行操作。不同的行为会带来不同的奖励,也可能使机器进入另一种环境状态。强化学习算法的目标是学习策略。
Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and is capable of perceiving the state of that environment as a vector of features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. The goal of a reinforcement learning algorithm is to learn a policy.
策略是一个函数(类似于监督学习中的模型),它将状态的特征向量作为输入并输出在该状态下执行的最佳操作。如果该行动使预期平均奖励最大化,则该行动是最优的。
A policy is a function (similar to the model in supervised learning) that takes the feature vector of a state as input and outputs an optimal action to execute in that state. The action is optimal if it maximizes the expected average reward.
强化学习解决特定类型的问题,其中决策是连续的,并且目标是长期的,例如游戏、机器人、资源管理或物流。在本书中,我强调一次性决策,其中输入示例与过去做出的预测相互独立。我将强化学习排除在本书的讨论范围之外。
Reinforcement learning solves a particular kind of problem where decision making is sequential, and the goal is long-term, such as game playing, robotics, resource management, or logistics. In this book, I put emphasis on one-shot decision making where input examples are independent of one another and the predictions made in the past. I leave reinforcement learning out of the scope of this book.
在本节中,我将简要解释监督学习的工作原理,以便您在详细介绍之前了解整个过程。我决定使用监督学习作为示例,因为它是实践中最常用的机器学习类型。
In this section, I briefly explain how supervised learning works so that you have the picture of the whole process before we go into detail. I decided to use supervised learning as an example because it’s the type of machine learning most frequently used in practice.
监督学习过程从收集数据开始。监督学习的数据是对(输入、输出)的集合。输入可以是任何内容,例如电子邮件、图片或传感器测量结果。输出通常是实数或标签(例如“spam”、“not_spam”、“cat”、“dog”、“mouse”等)。在某些情况下,输出是向量(例如,图片上人物周围矩形的四个坐标)、序列(例如输入“big beautiful car”的[“形容词”、“形容词”、“名词”]),或者有一些其他的结构。
The supervised learning process starts with gathering the data. The data for supervised learning is a collection of pairs (input, output). Input could be anything, for example, email messages, pictures, or sensor measurements. Outputs are usually real numbers, or labels (e.g. “spam”, “not_spam”, “cat”, “dog”, “mouse”, etc). In some cases, outputs are vectors (e.g., four coordinates of the rectangle around a person on the picture), sequences (e.g. [“adjective”, “adjective”, “noun”] for the input “big beautiful car”), or have some other structure.
假设您想要使用监督学习解决的问题是垃圾邮件检测。您收集数据,例如 10,000 封电子邮件,每封电子邮件都带有“垃圾邮件”或“非垃圾邮件”标签(您可以手动添加这些标签或付费请人为您添加这些标签)。现在,您必须将每封电子邮件转换为特征向量。
Let’s say the problem that you want to solve using supervised learning is spam detection. You gather the data, for example, 10,000 email messages, each with a label either “spam” or “not_spam” (you could add those labels manually or pay someone to do that for you). Now, you have to convert each email message into a feature vector.
数据分析师根据他们的经验决定如何将现实世界的实体(例如电子邮件)转换为特征向量。将文本转换为特征向量(称为词袋)的一种常见方法是采用英语单词词典(假设它包含 20,000 个按字母顺序排序的单词)并在我们的特征向量中规定:
The data analyst decides, based on their experience, how to convert a real-world entity, such as an email message, into a feature vector. One common way to convert a text into a feature vector, called bag of words, is to take a dictionary of English words (let’s say it contains 20,000 alphabetically sorted words) and stipulate that in our feature vector:
您对我们集合中的每封电子邮件重复上述过程,这为我们提供了 10,000 个特征向量(每个向量的维数为 20,000)和一个标签(“spam”/“not_spam”)。
You repeat the above procedure for every email message in our collection, which gives us 10,000 feature vectors (each vector having the dimensionality of 20,000) and a label (“spam”/“not_spam”).
现在您有了机器可读的输入数据,但输出标签仍然采用人类可读文本的形式。一些学习算法需要将标签转换为数字。例如,某些算法需要像这样的数字(代表标签“not_spam”)和(代表“垃圾邮件”标签)。我用来说明监督学习的算法称为支持向量机(SVM)。该算法要求正面标签(在我们的例子中是“垃圾邮件”)的数值为(一),负标签(“not_spam”)的值为(减一)。
Now you have machine-readable input data, but the output labels are still in the form of human-readable text. Some learning algorithms require transforming labels into numbers. For example, some algorithms require numbers like (to represent the label “not_spam”) and (to represent the label “spam”). The algorithm I use to illustrate supervised learning is called Support Vector Machine (SVM). This algorithm requires that the positive label (in our case it’s “spam”) has the numeric value of (one), and the negative label (“not_spam”) has the value of (minus one).
此时,您已经有了一个数据集和一个学习算法,因此您准备将学习算法应用于数据集以获得模型。
At this point, you have a dataset and a learning algorithm, so you are ready to apply the learning algorithm to the dataset to get the model.
SVM 将每个特征向量视为高维空间中的一个点(在我们的例子中,空间是 20,000 维)。该算法将所有特征向量放在一个假想的 20,000 维图上,并绘制一条假想的 19,999 维线(超平面),将具有正标签的示例与具有负标签的示例分开。在机器学习中,分隔不同类示例的边界称为决策边界。
SVM sees every feature vector as a point in a high-dimensional space (in our case, space is 20,000-dimensional). The algorithm puts all feature vectors on an imaginary 20,000-dimensional plot and draws an imaginary 19,999-dimensional line (a hyperplane) that separates examples with positive labels from examples with negative labels. In machine learning, the boundary separating the examples of different classes is called the decision boundary.
超平面的方程由两个参数给出,一个实值向量与我们的输入特征向量具有相同的维度,和一个实数像这样:
The equation of the hyperplane is given by two parameters, a real-valued vector of the same dimensionality as our input feature vector , and a real number like this:
其中表达式方法, 和是特征向量的维数。
where the expression means , and is the number of dimensions of the feature vector .
(如果您现在还不清楚某些方程,在第 2 章中,我们将重新审视理解它们所需的数学和统计概念。目前,请尝试直观地了解这里发生的情况。阅读完后,一切都会变得更加清晰下一章。)
(If some equations aren’t clear to you right now, in Chapter 2 we revisit the math and statistical concepts necessary to understand them. For the moment, try to get an intuition of what’s happening here. It all becomes more clear after you read the next chapter.)
现在,一些输入特征向量的预测标签是这样给出的:
Now, the predicted label for some input feature vector is given like this:
在哪里是一个数学运算符,它接受任何值作为输入并返回如果输入是正数或如果输入是负数。
where is a mathematical operator that takes any value as input and returns if the input is a positive number or if the input is a negative number.
学习算法(本例中为 SVM)的目标是利用数据集并找到最佳值和对于参数和。一旦学习算法识别出这些最佳值,模型就会 然后定义为:
The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the optimal values and for parameters and . Once the learning algorithm identifies these optimal values, the model is then defined as:
因此,要使用 SVM 模型预测电子邮件是否为垃圾邮件,您必须获取邮件文本,将其转换为特征向量,然后将该向量乘以, 减去并取结果的符号。这将为我们提供预测(意思是“垃圾邮件”,意思是“非垃圾邮件”)。
Therefore, to predict whether an email message is spam or not spam using an SVM model, you have to take the text of the message, convert it into a feature vector, then multiply this vector by , subtract and take the sign of the result. This will give us the prediction ( means “spam”, means “not_spam”).
现在,机器如何找到和?它解决了一个优化问题。机器擅长在约束下优化功能。
Now, how does the machine find and ? It solves an optimization problem. Machines are good at optimizing functions under constraints.
那么我们在这里想要满足哪些约束呢?首先,我们希望模型能够正确预测 10,000 个示例的标签。请记住每个示例由一对给出, 在哪里是样本的特征向量和它的标签可以取值或者。那么约束条件自然是:
So what are the constraints we want to satisfy here? First of all, we want the model to predict the labels of our 10,000 examples correctly. Remember that each example is given by a pair , where is the feature vector of example and is its label that takes values either or . So the constraints are naturally:
我们还希望超平面能够以最大的间隔将正例与负例分开。边距是由决策边界定义的两个类中最接近的示例之间的距离。较大的余量有助于更好的泛化,即模型将来对新示例进行分类的效果。为了实现这一点,我们需要最小化欧几里得范数表示为并由。
We would also prefer that the hyperplane separates positive examples from negative ones with the largest margin. The margin is the distance between the closest examples of two classes, as defined by the decision boundary. A large margin contributes to a better generalization, that is how well the model will classify new examples in the future. To achieve that, we need to minimize the Euclidean norm of denoted by and given by .
因此,我们希望机器解决的优化问题如下所示:
So, the optimization problem that we want the machine to solve looks like this:
最小化 受 为了 。表达方式只是写出上述两个约束的紧凑方式。
Minimize subject to for . The expression is just a compact way to write the above two constraints.
该优化问题的解由下式给出和,称为统计模型,或简称模型。构建模型的过程称为训练。
The solution of this optimization problem, given by and , is called the statistical model, or, simply, the model. The process of building the model is called training.
对于二维特征向量,问题和解决方案可以如图 2 所示可视化。 1 .蓝色和橙色圆圈分别代表正例和负例,线由下式给出是决策边界。
For two-dimensional feature vectors, the problem and the solution can be visualized as shown in fig. 1. The blue and orange circles represent, respectively, positive and negative examples, and the line given by is the decision boundary.
为什么,通过最小化范数,我们是否找到两个类别之间的最高差值?几何上,方程和定义两个平行的超平面,如图所示。 1 .这些超平面之间的距离由下式给出,所以范数越小,这两个超平面之间的距离越大。
Why, by minimizing the norm of , do we find the highest margin between the two classes? Geometrically, the equations and define two parallel hyperplanes, as you see in fig. 1. The distance between these hyperplanes is given by , so the smaller the norm , the larger the distance between these two hyperplanes.
这就是支持向量机的工作原理。该算法的这个特定版本构建了所谓的线性模型。之所以称为线性,是因为决策边界是一条直线(或平面或超平面)。 SVM 还可以包含可以使决策边界任意非线性的内核。在某些情况下,由于数据中的噪声、标记错误或异常值(示例与数据集中的“典型”示例非常不同),可能无法完美分离两组点。SVM 的另一个版本还可以包含惩罚超参数3,用于对特定类别的训练示例进行错误分类。我们将在第 3 章中更详细地研究 SVM 算法。
That’s how Support Vector Machines work. This particular version of the algorithm builds the so-called linear model. It’s called linear because the decision boundary is a straight line (or a plane, or a hyperplane). SVM can also incorporate kernels that can make the decision boundary arbitrarily non-linear. In some cases, it could be impossible to perfectly separate the two groups of points because of noise in the data, errors of labeling, or outliers (examples very different from a “typical” example in the dataset). Another version of SVM can also incorporate a penalty hyperparameter3 for misclassification of training examples of specific classes. We study the SVM algorithm in more detail in Chapter 3.
此时,您应该保留以下内容:任何隐式或显式构建模型的分类学习算法都会创建决策边界。决策边界可以是直的,也可以是曲线的,也可以是复杂的形式,也可以是一些几何图形的叠加。决策边界的形式决定了模型的准确性(即标签被正确预测的示例的比例)。决策边界的形式,即基于训练数据进行算法或数学计算的方式,将一种学习算法与另一种学习算法区分开来。
At this point, you should retain the following: any classification learning algorithm that builds a model implicitly or explicitly creates a decision boundary. The decision boundary can be straight, or curved, or it can have a complex form, or it can be a superposition of some geometrical figures. The form of the decision boundary determines the accuracy of the model (that is the ratio of examples whose labels are predicted correctly). The form of the decision boundary, the way it is algorithmically or mathematically computed based on the training data, differentiates one learning algorithm from another.
在实践中,学习算法还有两个重要的区别因素需要考虑:模型构建的速度和预测处理时间。在许多实际情况下,您更喜欢快速构建不太准确的模型的学习算法。此外,您可能更喜欢精度较低但预测速度更快的模型。
In practice, there are two other essential differentiators of learning algorithms to consider: speed of model building and prediction processing time. In many practical cases, you would prefer a learning algorithm that builds a less accurate model quickly. Additionally, you might prefer a less accurate model that is much quicker at making predictions.
为什么机器学习模型能够正确预测新的、以前未见过的示例的标签?要理解这一点,请查看图 1 中的图。 1 .如果两个类可以通过决策边界彼此分离,那么显然,属于每个类的示例位于决策边界创建的两个不同的子空间中。
Why is a machine-learned model capable of predicting correctly the labels of new, previously unseen examples? To understand that, look at the plot in fig. 1. If two classes are separable from one another by a decision boundary, then, obviously, examples that belong to each class are located in two different subspaces which the decision boundary creates.
如果用于训练的示例是随机选择的,彼此独立,并遵循相同的过程,那么从统计角度来看,新的负示例更有可能位于绘图上距离其他负示例不太远的位置。新的正面例子也是如此:它很可能来自其他正面例子的周围。在这种情况下,我们的决策边界仍将以高概率将新的正面和负面示例彼此分开。对于其他不太可能的情况,我们的模型会出错,但由于这种情况不太可能,错误的数量可能会小于正确预测的数量。
If the examples used for training were selected randomly, independently of one another, and following the same procedure, then, statistically, it is more likely that the new negative example will be located on the plot somewhere not too far from other negative examples. The same concerns the new positive example: it will likely come from the surroundings of other positive examples. In such a case, our decision boundary will still, with high probability, separate well new positive and negative examples from one another. For other, less likely situations, our model will make errors, but because such situations are less likely, the number of errors will likely be smaller than the number of correct predictions.
直观上,训练示例集越大,新示例与用于训练的示例不同(并且在图上相距甚远)的可能性就越小。
Intuitively, the larger is the set of training examples, the more unlikely that the new examples will be dissimilar to (and lie on the plot far from) the examples used for training.
为了最大限度地减少在新示例上出错的概率,SVM 算法通过寻找最大余量,显式地尝试以尽可能远离两类示例的方式绘制决策边界。
To minimize the probability of making errors on new examples, the SVM algorithm, by looking for the largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far as possible from examples of both classes.
有兴趣了解更多关于可学习性并了解模型误差、训练集大小、定义模型的数学方程形式以及构建模型所需时间之间密切关系的读者,鼓励阅读关于PAC学习。 PAC(“可能近似正确”)学习理论有助于分析学习算法是否以及在什么条件下可能输出近似正确的分类器。
The reader interested in knowing more about the learnability and understanding the close relationship between the model error, the size of the training set, the form of the mathematical equation that defines the model, and the time it takes to build the model is encouraged to read about the PAC learning. The PAC (for “probably approximately correct”) learning theory helps to analyze whether and under what conditions a learning algorithm will probably output an approximately correct classifier.
如果表达式以粗体显示,则表示这是科学术语的技术术语。如果你在书中再次遇到这个词,这个词的含义将完全相同。↩
If an expression is in bold, that means that this is a technical of a scientific term. If you meet it once again in the book, the term will have exactly the same meaning.↩
实数是可以表示沿直线的距离的量。例子:,,,。↩
A real number is a quantity that can represent a distance along a line. Examples: , , , .↩
超参数是学习算法的一个属性,通常(但不总是)具有数值。该值影响算法的工作方式。这些值不是由算法本身从数据中学习的。它们必须由数据分析师在运行算法之前设置。↩
A hyperparameter is a property of a learning algorithm, usually (but not always) having a numerical value. That value influences the way the algorithm works. Those values aren’t learned by the algorithm itself from data. They have to be set by the data analyst before running the algorithm.↩
让我们首先回顾一下我们在学校学到的数学符号,但有些人可能在舞会结束后就忘记了。
Let’s start by revisiting the mathematical notation we all learned at school, but some likely forgot right after the prom.
标量是一个简单的数值,例如或者。采用标量值的变量或常量用斜体字母表示,例如或者。
A scalar is a simple numerical value, like or . Variables or constants that take scalar values are denoted by an italic letter, like or .
向量是标量值的有序列表,称为属性。我们将向量表示为粗体字符,例如,或者。向量可以可视化为指向某些方向的箭头以及多维空间中的点。三个二维向量的插图,,, 和在图中给出。 2和图。 3 .我们将向量的属性表示为带有索引的斜体值,如下所示:或者。指数表示向量的特定维度,即属性在列表中的位置。例如,在向量中如图中红色所示 2和图。 3 ,和。
A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold character, for example, or . Vectors can be visualized as arrows that point to some directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional vectors, , , and are given in fig. 2 and fig. 3. We denote an attribute of a vector as an italic value with an index, like this: or . The index denotes a specific dimension of the vector, the position of an attribute in the list. For instance, in the vector shown in red in fig. 2 and fig. 3, and .
符号不应与幂运算符混淆,例如在(平方)或在(立方)。如果我们想将幂运算符(例如平方)应用于向量的索引属性,我们可以这样写:。
The notation should not be confused with the power operator, such as the in (squared) or in (cubed). If we want to apply a power operator, say square, to an indexed attribute of a vector, we write like this: .
一个变量可以有两个或多个索引,如下所示:或者像这样。例如,在神经网络中,我们表示为输入特征单位的层内。
A variable can have two or more indices, like this: or like this . For example, in neural networks, we denote as the input feature of unit in layer .
矩阵是按行和列排列的数字的矩形阵列。下面是一个两行三列的矩阵示例,
A matrix is a rectangular array of numbers arranged in rows and columns. Below is an example of a matrix with two rows and three columns,
矩阵用粗体大写字母表示,例如或者。
Matrices are denoted with bold capital letters, such as or .
集合是唯一元素的无序集合。我们将集合表示为书法大写字符,例如,。一组数字可以是有限的(包括固定数量的值)。在这种情况下,它使用荣誉来表示,例如,或者。集合可以是无限的,并且包括某个区间内的所有值。如果一个集合包含之间的所有值和, 包括和,用括号表示为。如果集合不包含值和,这样的集合使用括号表示,如下所示:。例如,集合包括这样的值,,,,, 和。一个特殊的集合表示包括从负无穷大到正无穷大的所有数字。
A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, . A set of numbers can be finite (include a fixed amount of values). In this case, it is denoted using accolades, for example, or . A set can be infinite and include all values in some interval. If a set includes all values between and , including and , it is denoted using brackets as . If the set doesn’t include the values and , such a set is denoted using parentheses like this: . For example, the set includes such values as , , , , , and . A special set denoted includes all numbers from minus infinity to plus infinity.
当一个元素属于一个集合, 我们写。我们可以获得一套新的作为两个集合的交集和。在这种情况下,我们写。例如给出新的集合。
When an element belongs to a set , we write . We can obtain a new set as an intersection of two sets and . In this case, we write . For example gives the new set .
我们可以获得一套新的作为两个集合的并集和。在这种情况下,我们写。例如给出新的集合。
We can obtain a new set as a union of two sets and . In this case, we write . For example gives the new set .
对集合的求和或向量的属性表示如下:
The summation over a collection or over the attributes of a vector is denoted like this:
要不然:
or else:
符号意思是“被定义为”。
The notation means “is defined as”.
类似于大写 sigma 的符号是大写 pi 符号。它表示集合中元素或向量属性的乘积:
A notation analogous to capital sigma is the capital pi notation. It denotes a product of elements in a collection or attributes of a vector:
在哪里方法乘以。在可能的情况下,我们省略为了简化符号,所以也意味着乘以。
where means multiplied by . Where possible, we omit to simplify the notation, so also means multiplied by .
派生集创建运算符如下所示:。这个符号意味着我们创建一个新集合通过放入其中平方使得是在, 和大于。
A derived set creation operator looks like this: . This notation means that we create a new set by putting into it squared such that that is in , and is greater than .
基数运算符返回集合中元素的数量。
The cardinality operator returns the number of elements in set .
两个向量的和定义为向量。
The sum of two vectors is defined as the vector .
两个向量的差定义为。
The difference of two vectors is defined as .
向量乘以标量就是向量。例如。
A vector multiplied by a scalar is a vector. For example .
两个向量的点积是标量。例如,。在一些书中,点积表示为。两个向量必须具有相同的维度。否则,点积未定义。
A dot-product of two vectors is a scalar. For example, . In some books, the dot-product is denoted as . The two vectors must be of the same dimensionality. Otherwise, the dot-product is undefined.
矩阵的乘法通过向量结果产生另一个向量。让我们的矩阵是,
The multiplication of a matrix by a vector results in another vector. Let our matrix be,
当向量参与矩阵运算时,向量默认表示为一列矩阵。当向量位于矩阵右侧时,它仍然是列向量。仅当向量的行数与矩阵的列数相同时,我们才能将矩阵乘以向量。设我们的向量为。然后是一个二维向量,定义为,
When vectors participate in operations on matrices, a vector is by default represented as a matrix with one column. When the vector is on the right of the matrix, it remains a column vector. We can only multiply a matrix by vector if the vector has the same number of rows as the number of columns in the matrix. Let our vector be . Then is a two-dimensional vector defined as,
如果我们的矩阵有五行,那么乘积的结果将是一个五维向量。
If our matrix had, say, five rows, the result of the product would be a five-dimensional vector.
当向量在乘法中位于矩阵的左侧时,在与矩阵相乘之前必须对其进行转置。向量的转置表示为从列向量中生成行向量。比方说,
When the vector is on the left side of the matrix in the multiplication, then it has to be transposed before we multiply it by the matrix. The transpose of the vector denoted as makes a row vector out of a column vector. Let’s say,
向量的乘法由矩阵是(谁)给的,
The multiplication of the vector by the matrix is given by ,
正如您所看到的,只有当向量的维数与矩阵的行数相同时,我们才能将向量乘以矩阵。
As you can see, we can only multiply a vector by a matrix if the vector has the same number of dimensions as the number of rows in the matrix.
函数是关联每个元素的关系一组的,函数的域,到单个元素另一组的,函数的余域。函数通常有一个名称。如果该函数被调用,这个关系表示为(读的), 元素是函数的参数或输入,并且是函数或输出的值。用于表示输入的符号是函数的变量(我们常说是变量的函数)。
A function is a relation that associates each element of a set , the domain of the function, to a single element of another set , the codomain of the function. A function usually has a name. If the function is called , this relation is denoted (read of ), the element is the argument or input of the function, and is the value of the function or the output. The symbol that is used for representing the input is the variable of the function (we often say that is a function of the variable ).
我们这么说局部最小值为如果对于每一个在一些开区间内。区间是一组实数,其属性是位于该集合中两个数字之间的任何数字也包含在该集合中。开区间不包括其端点,并使用括号表示。例如,意思是“所有大于并且小于”。所有局部最小值中的最小值称为全局最小值。参见图 1 中的插图。 4 .
We say that has a local minimum at if for every in some open interval around . An interval is a set of real numbers with the property that any number that lies between two numbers in the set is also included in the set. An open interval does not include its endpoints and is denoted using parentheses. For example, means “all numbers greater than and less than ”. The minimal value among all the local minima is called the global minimum. See illustration in fig. 4.
向量函数,表示为是一个返回向量的函数。它可以有一个向量或一个标量参数。
A vector function, denoted as is a function that returns a vector . It can have a vector or a scalar argument.
给定一组值, 运营商返回最高值对于集合中的所有元素。另一方面,运营商返回集合的元素最大化。
Given a set of values , the operator returns the highest value for all elements in the set . On the other hand, the operator returns the element of the set that maximizes .
有时,当集合是隐式的或无限的时,我们可以写或者。
Sometimes, when the set is implicit or infinite, we can write or .
运营商和以类似的方式操作。
Operators and operate in a similar manner.
表达方式意味着变量得到新值:结果。我们说变量被分配一个新值。相似地,意味着向量变量获取二维向量值。
The expression means that the variable gets the new value: the result of . We say that the variable gets assigned a new value. Similarly, means that the vector variable gets the two-dimensional vector value .
衍生品 一个函数的是一个函数或一个值,描述多快增长(或减少)。如果导数是一个常数值,例如或者,则函数在任意点不断增大(或减小)其域。如果导数是一个函数,那么函数可以在其领域的不同区域以不同的速度增长。如果导数在某个时刻是积极的,那么函数在这一点上成长。如果导数为在某些情况下是负数,则此时函数减小。零处的导数意味着函数的斜率在是水平的。
A derivative of a function is a function or a value that describes how fast grows (or decreases). If the derivative is a constant value, like or , then the function grows (or decreases) constantly at any point of its domain. If the derivative is a function, then the function can grow at a different pace in different regions of its domain. If the derivative is positive at some point , then the function grows at this point. If the derivative of is negative at some , then the function decreases at this point. The derivative of zero at means that the function’s slope at is horizontal.
寻找导数的过程称为微分。
The process of finding a derivative is called differentiation.
基本函数的导数是已知的。例如如果, 然后;如果然后;如果然后(任何函数的导数, 在哪里是一个常数,为零)。
Derivatives for basic functions are known. For example if , then ; if then ; if then (the derivative of any function , where is a constant value, is zero).
如果我们要求导的函数不是基函数,我们可以利用链式法则求它的导数。例如如果, 在哪里和是一些函数,那么。例如如果然后和。应用链式法则,我们发现:
If the function we want to differentiate is not basic, we can find its derivative using the chain rule. For instance if , where and are some functions, then . For example if then and . By applying the chain rule, we find:
梯度是采用多个输入(或采用向量或某种其他复杂结构形式的一个输入)的函数导数的概括。函数的梯度是偏导数的向量。您可以将求函数的偏导数视为通过关注函数的输入之一并将所有其他输入视为常数值来求导数的过程。
Gradient is the generalization of derivative for functions that take several inputs (or one input in the form of a vector or some other complex structure). A gradient of a function is a vector of partial derivatives. You can look at finding a partial derivative of a function as the process of finding the derivative by focusing on one of the function’s inputs and by considering all other inputs as constant values.
例如,如果我们的函数定义为,然后函数的偏导数 关于 ,表示为, 是(谁)给的,
For example, if our function is defined as , then the partial derivative of function with respect to , denoted as , is given by,
在哪里是函数的导数;两个零点分别是和, 因为当我们计算关于的导数时被认为是常数,并且任何常数的导数为零。
where is the derivative of the function ; the two zeroes are respectively derivatives of and , because is considered constant when we compute the derivative with respect to , and the derivative of any constant is zero.
类似地,函数的偏导数关于,, 是(谁)给的,
Similarly, the partial derivative of function with respect to , , is given by,
函数的梯度,表示为由向量给出。
The gradient of function , denoted as is given by the vector .
正如我在第 4 章中所说明的,链式法则也适用于偏导数。
The chain rule works with partial derivatives too, as I illustrate in Chapter 4.
随机变量,通常写为斜体大写字母,例如,是一个变量,其可能值是随机现象的数值结果。具有数值结果的随机现象的例子包括抛硬币(对于头部和尾巴)、掷骰子或您在外面遇到的第一个陌生人的身高。有两种类型的随机变量:离散变量和连续变量。
A random variable, usually written as an italic capital letter, like , is a variable whose possible values are numerical outcomes of a random phenomenon. Examples of random phenomena with a numerical outcome include a toss of a coin ( for heads and for tails), a roll of a dice, or the height of the first stranger you meet outside. There are two types of random variables: discrete and continuous.
离散随机变量仅具有可数个不同值,例如,,或者,,,。
A discrete random variable takes on only a countable number of distinct values such as , , or , , , .
离散随机变量的概率分布由与其每个可能值相关的概率列表来描述。该概率列表称为概率质量函数(pmf)。例如:,,。概率质量函数中的每个概率都是大于或等于的值。概率之和等于(图 5)。
The probability distribution of a discrete random variable is described by a list of probabilities associated with each of its possible values. This list of probabilities is called a probability mass function (pmf). For example: , , . Each probability in a probability mass function is a value greater than or equal to . The sum of probabilities equals (fig. 5).
连续随机变量( CRV) 在某个区间内具有无限多个可能值。示例包括身高、体重和时间。因为连续随机变量的值的数量是无穷大,概率对于任何是。因此,CRV 的概率分布(连续概率分布)不是用概率列表来描述,而是用概率密度函数(pdf) 来描述。 pdf 是一个函数,其余域为非负,曲线下面积等于(图 6)。
A continuous random variable (CRV) takes an infinite number of possible values in some interval. Examples include height, weight, and time. Because the number of values of a continuous random variable is infinite, the probability for any is . Therefore, instead of the list of probabilities, the probability distribution of a CRV (a continuous probability distribution) is described by a probability density function (pdf). The pdf is a function whose codomain is nonnegative and the area under the curve is equal to (fig. 6).
设离散随机变量有可能的值。的期望表示为是(谁)给的,
Let a discrete random variable have possible values . The expectation of denoted as is given by,
在哪里是概率有价值根据pmf。随机变量的期望也称为均值、平均数或期望值,通常用字母表示。期望是随机变量最重要的统计量之一。
where is the probability that has the value according to the pmf. The expectation of a random variable is also called the mean, average or expected value and is frequently denoted with the letter . The expectation is one of the most important statistics of a random variable.
另一个重要的统计数据是标准差,定义为:
Another important statistic is the standard deviation, defined as,
方差,表示为或者,定义为,
Variance, denoted as or , is defined as,
对于离散随机变量,标准差由下式给出:
For a discrete random variable, the standard deviation is given by:
在哪里。
where .
连续随机变量的期望是(谁)给的,
The expectation of a continuous random variable is given by,
在哪里是变量的 pdf和是函数的积分。
where is the pdf of the variable and is the integral of function .
当函数具有连续域时,积分相当于函数所有值的总和。它等于函数曲线下的面积。 pdf 的特性是曲线下面积为数学上意味着。
Integral is an equivalent of the summation over all values of the function when the function has a continuous domain. It equals the area under the curve of the function. The property of the pdf that the area under its curve is mathematically means that .
大多数时候我们不知道,但我们可以观察到一些值。在机器学习中,我们将这些值称为示例,这些示例的集合称为样本或数据集。
Most of the time we don’t know , but we can observe some values of . In machine learning, we call these values examples, and the collection of these examples is called a sample or a dataset.
因为通常是未知的,但我们有一个样本,我们常常不满足于概率分布统计的真实值,例如期望值,而是满足于它们的无偏估计量。
Because is usually unknown, but we have a sample , we often content ourselves not with the true values of statistics of the probability distribution, such as expectation, but with their unbiased estimators.
我们这么说是某些统计量的无偏估计量使用样本计算从未知的概率分布中得出如果具有以下属性:
We say that is an unbiased estimator of some statistic calculated using a sample drawn from an unknown probability distribution if has the following property:
在哪里是样本统计量,使用样本获得而不是真实的统计数据只有知道才能获得;期望值取自所有可能的样本。直观地说,这意味着如果您可以拥有无限数量的此类样本,例如,然后计算一些无偏估计量,例如,使用每个样本,然后计算所有这些样本的平均值等于真实统计数据你会得到计算。
where is a sample statistic, obtained using a sample and not the real statistic that can be obtained only knowing ; the expectation is taken over all possible samples drawn from . Intuitively, this means that if you can have an unlimited number of such samples as , and you compute some unbiased estimator, such as , using each sample, then the average of all these equals the real statistic that you would get computed on .
可以证明,未知数的无偏估计量(由等式 1或等式 2给出)由下式给出(在统计学中称为样本平均值)。
It can be shown that an unbiased estimator of an unknown (given by either eq. 1 or eq. 2) is given by (called in statistics the sample mean).
条件概率是随机变量的概率具有特定的值考虑到另一个随机变量具有特定的值。贝叶斯规则(也称为贝叶斯定理)规定:
The conditional probability is the probability of the random variable to have a specific value given that another random variable has a specific value of . The Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:
当我们有一个模型时,贝叶斯规则就会派上用场的分布,以及这个模型是一个具有向量形式参数的函数。这种函数的一个例子是具有两个参数的高斯函数,和,定义为:
Bayes’ Rule comes in handy when we have a model of ’s distribution, and this model is a function that has some parameters in the form of a vector . An example of such a function could be the Gaussian function that has two parameters, and , and is defined as:
在哪里和是常数 ()。
where and is the constant ().
该函数具有 pdf 1的所有属性。因此,我们可以将其用作未知分布的模型。我们可以更新向量中的参数值使用贝叶斯规则从数据中得出:
This function has all the properties of a pdf1. Therefore, we can use it as a model of an unknown distribution of . We can update the values of parameters in the vector from the data using the Bayes’ Rule:
在哪里。
where .
如果我们有样品的以及可能值的集合是有限的,我们可以很容易地估计通过迭代应用贝叶斯规则,一个例子一次。初始值可以猜测。这种对不同概率的猜测称为先验.
If we have a sample of and the set of possible values for is finite, we can easily estimate by applying Bayes’ Rule iteratively, one example at a time. The initial value can be guessed such that . This guess of the probabilities for different is called the prior.
首先,我们计算对于所有可能的值。然后,在更新之前再一次,这次是为了使用等式 4、我们替换掉之前的在等式中 4按新估计。
First, we compute for all possible values . Then, before updating once again, this time for using eq. 4, we replace the prior in eq. 4 by the new estimate .
参数的最佳值给定一个例子是使用最大后验概率(或 MAP)原理获得的:
The best value of the parameters given one example is obtained using the principle of maximum a posteriori (or MAP):
如果可能值的集合不是有限的,那么我们需要优化eq。 5直接使用数值优化例程,例如我们在第 4 章中考虑的梯度下降。通常,我们优化等式 5 中右侧表达式的自然对数。 5因为乘积的对数变成了对数之和,并且机器处理和比处理乘积2更容易。
If the set of possible values for isn’t finite, then we need to optimize eq. 5 directly using a numerical optimization routine, such as gradient descent, which we consider in Chapter 4. Usually, we optimize the natural logarithm of the right-hand side expression in eq. 5 because the logarithm of a product becomes the sum of logarithms and it’s easier for the machine to work with the sum than with a product2.
超参数是学习算法的一个属性,通常(但不总是)具有数值。该值影响算法的工作方式。超参数不是由算法本身从数据中学习的。它们必须由数据分析师在运行算法之前设置。我将在第 5 章中展示如何做到这一点。
A hyperparameter is a property of a learning algorithm, usually (but not always) having a numerical value. That value influences the way the algorithm works. Hyperparameters aren’t learned by the algorithm itself from data. They have to be set by the data analyst before running the algorithm. I show how to do that in Chapter 5.
参数是定义学习算法学习的模型的变量。学习算法根据训练数据直接修改参数。学习的目标是找到使模型在某种意义上最优的参数值。
Parameters are variables that define the model learned by the learning algorithm. Parameters are directly modified by the learning algorithm based on the training data. The goal of learning is to find such values of parameters that make the model optimal in a certain sense.
分类是自动为未标记的示例分配标签的问题。垃圾邮件检测是分类的一个著名例子。
Classification is a problem of automatically assigning a label to an unlabeled example. Spam detection is a famous example of classification.
在机器学习中,分类问题是通过分类学习算法来解决的,该算法将标记示例的集合作为输入,并生成一个模型,该模型可以将未标记示例作为输入,并直接输出标签或输出可以使用的数字分析师推断标签。这种数字的一个例子是概率。
In machine learning, the classification problem is solved by a classification learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and either directly output a label or output a number that can be used by the analyst to deduce the label. An example of such a number is a probability.
在分类问题中,标签是有限类集的成员。如果类别集的大小为两个(“sick”/“healthy”、“spam”/“not_spam”),我们就讨论二元分类(在某些来源中也称为二项式)。多类分类(也称为多项式)是具有三个或更多类的分类问题3。
In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (“sick”/“healthy”, “spam”/“not_spam”), we talk about binary classification (also called binomial in some sources). Multiclass classification (also called multinomial) is a classification problem with three or more classes3.
虽然某些学习算法自然允许两个以上的类别,但其他学习算法本质上是二元分类算法。有一些策略允许将二元分类学习算法转变为多类学习算法。我将在第 7 章中讨论其中之一。
While some learning algorithms naturally allow for more than two classes, others are by nature binary classification algorithms. There are strategies allowing to turn a binary classification learning algorithm into a multiclass one. I talk about one of them in Chapter 7.
回归是在给定未标记示例的情况下预测实值标签(通常称为目标)的问题。根据房屋特征(例如面积、卧室数量、位置等)估算房价估值是回归的一个著名例子。
Regression is a problem of predicting a real-valued label (often called a target) given an unlabeled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression.
回归问题通过回归学习算法来解决,该算法将标记示例的集合作为输入,并生成一个模型,该模型可以将未标记示例作为输入并输出目标。
The regression problem is solved by a regression learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and output a target.
大多数监督学习算法都是基于模型的。我们已经见过这样一种算法:SVM。基于模型的学习算法使用训练数据来创建具有从训练数据中学习的参数的模型。在SVM中,我们看到的两个参数是和。模型建立后,训练数据可以被丢弃。
Most supervised learning algorithms are model-based. We have already seen one such algorithm: SVM. Model-based learning algorithms use the training data to create a model that has parameters learned from the training data. In SVM, the two parameters we saw were and . After the model was built, the training data can be discarded.
基于实例的学习算法使用整个数据集作为模型。实践中经常使用的一种基于实例的算法是k 最近邻算法(kNN)。在分类中,为了预测输入示例的标签,kNN 算法会查看特征向量空间中输入示例的近邻,并输出在该近邻中最常看到的标签。
Instance-based learning algorithms use the whole dataset as the model. One instance-based algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification, to predict a label for an input example the kNN algorithm looks at the close neighborhood of the input example in the space of feature vectors and outputs the label that it saw the most often in this close neighborhood.
浅层学习算法直接从训练示例的特征中学习模型的参数。大多数监督学习算法都很肤浅。臭名昭著的例外是神经网络学习算法,特别是那些构建输入和输出之间具有多个层的神经网络的算法。这种神经网络称为深度神经网络。在深度神经网络学习(或者简称深度学习)中,与浅层学习相反,大多数模型参数不是直接从训练示例的特征中学习,而是从前面层的输出中学习。
A shallow learning algorithm learns the parameters of the model directly from the features of the training examples. Most supervised learning algorithms are shallow. The notorious exceptions are neural network learning algorithms, specifically those that build neural networks with more than one layer between input and output. Such neural networks are called deep neural networks. In deep neural network learning (or, simply, deep learning), contrary to shallow learning, most model parameters are learned not directly from the features of the training examples, but from the outputs of the preceding layers.
如果您现在不明白这意味着什么,请不要担心。我们将在第 6 章中更仔细地研究神经网络。
Don’t worry if you don’t understand what that means right now. We look at neural networks more closely in Chapter 6.
事实上,等式。 3定义了实践中最常用的概率分布之一(称为高斯分布或正态分布)的 pdf ,表示为。↩
In fact, eq. 3 defines the pdf of one of the most frequently used in practice probability distributions called Gaussian distribution or normal distribution and denoted as .↩
许多数字相乘可以得到非常小的结果或非常大的结果。当机器无法在内存中存储如此极端的数字时,常常会导致数值溢出的问题。 ↩
Multiplication of many numbers can give either a very small result or a very large one. It often results in the problem of numerical overflow when the machine cannot store such extreme numbers in memory.↩
但每个示例仍然有一个标签。↩
There’s still one label per example though.↩
在本章中,我描述了五种算法,它们不仅是最著名的,而且本身也非常有效,或者用作最有效的学习算法的构建块。
In this chapter, I describe five algorithms which are not just the most known but also either very effective on their own or are used as building blocks for the most effective learning algorithms out there.
线性回归是一种流行的回归学习算法,它学习的模型是输入示例特征的线性组合。
Linear regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example.
我们有一系列带标签的示例, 在哪里是集合的大小,是个示例的维特征向量,是一个实值1目标并且每个特征,,也是一个实数。
We have a collection of labeled examples , where is the size of the collection, is the -dimensional feature vector of example , is a real-valued1 target and every feature , , is also a real number.
我们想要建立一个模型作为示例特征的线性组合:在哪里是一个参数的维向量和是一个实数。符号意味着模型由两个值参数化:和。
We want to build a model as a linear combination of features of example : where is a -dimensional vector of parameters and is a real number. The notation means that the model is parametrized by two values: and .
我们将使用该模型来预测未知对于给定的像这样:。由两个不同对参数化的两个模型当应用于同一示例时,可能会产生两个不同的预测。我们想要找到最优值。显然,参数的最佳值定义了做出最准确预测的模型。
We will use the model to predict the unknown for a given like this: . Two models parametrized by two different pairs will likely produce two different predictions when applied to the same example. We want to find the optimal values . Obviously, the optimal values of parameters define the model that makes the most accurate predictions.
您可能已经注意到方程中线性模型的形式。 图6与SVM模型的形式非常相似。唯一的区别就是缺少了操作员。这两个模型确实很相似。然而,SVM 中的超平面扮演着决策边界的角色:它用于将两组示例彼此分开。因此,它必须尽可能远离每个组。
You could have noticed that the form of our linear model in eq. 6 is very similar to the form of the SVM model. The only difference is the missing operator. The two models are indeed similar. However, the hyperplane in the SVM plays the role of the decision boundary: it’s used to separate two groups of examples from one another. As such, it has to be as far from each group as possible.
另一方面,线性回归中的超平面被选择为尽可能接近所有训练样本。
On the other hand, the hyperplane in linear regression is chosen to be as close to all training examples as possible.
通过查看图 1 中的插图,您可以了解为什么后一个要求至关重要。 7 .它显示一维示例(蓝点)的回归线(红色)。我们可以使用这条线来预测目标的值对于新的未标记输入示例。如果我们的例子是维特征向量(对于),与一维情况的唯一区别是回归模型不是一条线,而是一个平面(对于二维)或超平面(对于)。
You can see why this latter requirement is essential by looking at the illustration in fig. 7. It displays the regression line (in red) for one-dimensional examples (blue dots). We can use this line to predict the value of the target for a new unlabeled input example . If our examples are -dimensional feature vectors (for ), the only difference with the one-dimensional case is that the regression model is not a line but a plane (for two dimensions) or a hyperplane (for ).
现在您明白为什么要求回归超平面尽可能接近训练示例是至关重要的:如果图中的红线。 7离蓝点很远,预测正确的机会就会减少。
Now you see why it’s essential to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the red line in fig. 7 was far from the blue dots, the prediction would have fewer chances to be correct.
为了满足后一个要求,我们使用优化过程来寻找最佳值和尝试最小化以下表达式:
To get this latter requirement satisfied, the optimization procedure which we use to find the optimal values for and tries to minimize the following expression:
在数学中,我们最小化或最大化的表达式称为目标函数,或者简称为目标。表达方式上述目标中的称为损失函数。这是对示例错误分类的惩罚措施。损失函数的这种特殊选择称为平方误差损失。所有基于模型的学习算法都有一个损失函数,为了找到最佳模型,我们所做的就是尝试最小化称为成本函数的目标。在线性回归中,成本函数由平均损失给出,也称为经验风险。模型的平均损失或经验风险是将模型应用于训练数据而获得的所有惩罚的平均值。
In mathematics, the expression we minimize or maximize is called an objective function, or, simply, an objective. The expression in the above objective is called the loss function. It’s a measure of penalty for misclassification of example . This particular choice of the loss function is called squared error loss. All model-based learning algorithms have a loss function and what we do to find the best model is we try to minimize the objective known as the cost function. In linear regression, the cost function is given by the average loss, also called the empirical risk. The average loss, or empirical risk, for a model, is the average of all penalties obtained by applying the model to the training data.
为什么线性回归中的损失是二次函数?为什么我们不能得到真实目标之间差异的绝对值和预测值并以此作为惩罚?我们可以。此外,我们还可以使用立方体来代替正方形。
Why is the loss in linear regression a quadratic function? Why couldn’t we get the absolute value of the difference between the true target and the predicted value and use that as a penalty? We could. Moreover, we also could use a cube instead of a square.
现在您可能开始意识到,当我们设计机器学习算法时,会做出多少看似随意的决定:我们决定使用特征的线性组合来预测目标。但是,我们可以使用平方或其他多项式来组合特征值。我们还可以使用其他一些有意义的损失函数:和也有道理,差的立方也是如此;二元损失(什么时候和是不同的并且当它们相同时)也有意义,对吧?
Now you probably start realizing how many seemingly arbitrary decisions are made when we design a machine learning algorithm: we decided to use the linear combination of features to predict the target. However, we could use a square or some other polynomial to combine the values of features. We could also use some other loss function that makes sense: the absolute difference between and makes sense, the cube of the difference too; the binary loss ( when and are different and when they are the same) also makes sense, right?
如果我们对模型的形式、损失函数的形式以及最小化平均损失以找到最佳参数值的算法的选择做出不同的决定,我们最终会发明一种不同的机器学习算法。听起来很容易,不是吗?但是,不要急于发明新的学习算法。它的不同并不意味着它在实践中会更好地工作。
If we made different decisions about the form of the model, the form of the loss function, and about the choice of the algorithm that minimizes the average loss to find the best values of parameters, we would end up inventing a different machine learning algorithm. Sounds easy, doesn’t it? However, do not rush to invent a new learning algorithm. The fact that it’s different doesn’t mean that it will work better in practice.
人们发明新的学习算法有两个主要原因之一:
People invent new learning algorithms for one of the two main reasons:
选择模型线性形式的一个实际理由是它很简单。当可以使用简单的模型时,为什么要使用复杂的模型呢?另一个考虑因素是线性模型很少会过度拟合。过度拟合是模型的属性,使得模型可以很好地预测训练期间使用的示例的标签,但在应用于训练期间学习算法未看到的示例时经常会出错。
One practical justification of the choice of the linear form for the model is that it’s simple. Why use a complex model when you can use a simple one? Another consideration is that linear models rarely overfit. Overfitting is the property of a model such that the model predicts very well labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training.
回归中过度拟合的一个例子如图 2 所示。 8 .用于构建红色回归线的数据与图 1 中的相同。 7 .不同的是,这次是次数为多项式的多项式回归。回归线几乎完美地预测了几乎所有训练示例的目标,但可能会在新数据上产生重大错误,如图 2 所示。 7为。我们将在第 5 章中详细讨论过度拟合以及如何避免过度拟合。
An example of overfitting in regression is shown in fig. 8. The data used to build the red regression line is the same as in fig. 7. The difference is that this time, this is the polynomial regression with a polynomial of degree . The regression line predicts almost perfectly the targets almost all training examples, but will likely make significant errors on new data, as you can see in fig. 7 for . We talk more about overfitting and how to avoid it in Chapter 5.
现在您知道为什么线性回归很有用了:它不会过度拟合。但是平方损失呢?为什么我们决定将其平方? 1805年,法国数学家阿德里安·玛丽·勒让德(Adrien-Marie Legendre)首次发表了衡量模型质量的平方和法,他指出在求和之前先对误差求平方很方便。他为什么这么说?绝对值不方便,因为它没有连续导数,这使得函数不平滑。当使用线性代数寻找优化问题的封闭式解时,不平滑的函数会产生不必要的困难。寻找函数最优值的封闭式解决方案是简单的代数表达式,通常比使用复杂的数值优化方法更可取,例如梯度下降(用于训练神经网络)。
Now you know why linear regression can be useful: it doesn’t overfit much. But what about the squared loss? Why did we decide that it should be squared? In 1805, the French mathematician Adrien-Marie Legendre, who first published the sum of squares method for gauging the quality of the model stated that squaring the error before summing is convenient. Why did he say that? The absolute value is not convenient, because it doesn’t have a continuous derivative, which makes the function not smooth. Functions that are not smooth create unnecessary difficulties when employing linear algebra to find closed form solutions to optimization problems. Closed form solutions to finding an optimum of a function are simple algebraic expressions and are often preferable to using complex numerical optimization methods, such as gradient descent (used, among others, to train neural networks).
直观上,平方惩罚也是有利的,因为它们根据差异的值夸大了真实目标和预测目标之间的差异。我们也可以使用 3 或 4 的幂,但它们的导数使用起来更复杂。
Intuitively, squared penalties are also advantageous because they exaggerate the difference between the true target and the predicted one according to the value of this difference. We might also use the powers 3 or 4, but their derivatives are more complicated to work with.
最后,为什么我们关心平均损失的导数?如果我们可以计算方程中函数的梯度: 7,然后我们可以将此梯度设置为零2并找到方程组的解,为我们提供最佳值和。
Finally, why do we care about the derivative of the average loss? If we can calculate the gradient of the function in eq. 7, we can then set this gradient to zero2 and find the solution to a system of equations that gives us the optimal values and .
首先要说的是,逻辑回归不是回归,而是一种分类学习算法。该名称源自统计学,是因为逻辑回归的数学公式与线性回归相似。
The first thing to say is that logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression.
我以二元分类为例解释逻辑回归。然而,它自然可以扩展到多类分类。
I explain logistic regression on the case of binary classification. However, it can naturally be extended to multiclass classification.
在逻辑回归中,我们仍然想要建模作为线性函数,但是,用二进制这并不简单。特征的线性组合,例如是一个从负无穷大到正无穷大的函数,而只有两个可能的值。
In logistic regression, we still want to model as a linear function of , however, with a binary this is not straightforward. The linear combination of features such as is a function that spans from minus infinity to plus infinity, while has only two possible values.
在没有计算机需要科学家进行手动计算的时代,他们渴望找到一种线性分类模型。他们发现如果我们将负面标签定义为和正标签为,我们只需要找到一个简单的连续函数,其余域为。在这种情况下,如果模型返回的输入值更接近于,然后我们分配一个负标签;否则,该示例被标记为正例。具有这种属性的函数之一是标准逻辑函数(也称为sigmoid 函数):
At the time where the absence of computers required scientists to perform manual calculations, they were eager to find a linear classification model. They figured out that if we define a negative label as and the positive label as , we would just need to find a simple continuous function whose codomain is . In such a case, if the value returned by the model for input is closer to , then we assign a negative label to ; otherwise, the example is labeled as positive. One function that has such a property is the standard logistic function (also known as the sigmoid function):
在哪里是自然对数的底(也称为欧拉数;在编程语言中也称为exp(x)函数)。其图形如图所示。 9 .
where is the base of the natural logarithm (also called Euler’s number; is also known as the exp(x) function in programming languages). Its graph is depicted in fig. 9.
逻辑回归模型如下所示:
The logistic regression model looks like this:
你可以看到这个熟悉的术语从线性回归。
You can see the familiar term from linear regression.
通过查看标准逻辑函数的图表,我们可以看到它有多符合我们的分类目的:如果我们优化和适当地,我们可以解释输出作为概率积极。例如,如果它高于或等于阈值我们会说这个类是积极的;否则,结果为负。实际上,阈值的选择可能会根据问题的不同而有所不同。当我们谈论模型性能评估时,我们会在第五章中回到这个讨论。
By looking at the graph of the standard logistic function, we can see how well it fits our classification purpose: if we optimize the values of and appropriately, we could interpret the output of as the probability of being positive. For example, if it’s higher than or equal to the threshold we would say that the class of is positive; otherwise, it’s negative. In practice, the choice of the threshold could be different depending on the problem. We return to this discussion in Chapter 5 when we talk about model performance assessment.
现在,我们如何找到最优的和?在线性回归中,我们最小化了经验风险,经验风险被定义为平均平方误差损失,也称为均方误差或 MSE。
Now, how do we find optimal and ? In linear regression, we minimized the empirical risk which was defined as the average squared error loss, also known as the mean squared error or MSE.
另一方面,在逻辑回归中,我们根据模型最大化训练集的可能性。在统计学中,似然函数定义了根据我们的模型观察(示例)的可能性。
In logistic regression, on the other hand, we maximize the likelihood of our training set according to the model. In statistics, the likelihood function defines how likely the observation (an example) is according to our model.
例如,让我们有一个带标签的示例在我们的训练数据中。还假设我们发现(猜测)一些特定值和我们的参数。如果我们现在应用我们的模型到使用等式 8我们会得到一些值作为输出。如果是正类,概率为根据我们的模型,正类由下式给出。同样,如果是负类,它是负类的可能性由下式给出。
For instance, let’s have a labeled example in our training data. Assume also that we found (guessed) some specific values and of our parameters. If we now apply our model to using eq. 8 we will get some value as output. If is the positive class, the likelihood of being the positive class, according to our model, is given by . Similarly, if is the negative class, the likelihood of it being the negative class is given by .
逻辑回归中的优化标准称为最大似然。我们现在不是像线性回归那样最小化平均损失,而是根据我们的模型最大化训练数据的可能性:
The optimization criterion in logistic regression is called maximum likelihood. Instead of minimizing the average loss, like in linear regression, we now maximize the likelihood of the training data according to our model:
表达方式可能看起来很可怕,但这只是一种奇特的数学表达方式:“什么时候和否则”。确实,如果, 然后等于因为我们知道任何力量等于。另一方面,如果, 然后等于为了同样的原因。
The expression may look scary but it’s just a fancy mathematical way of saying: “ when and otherwise”. Indeed, if , then equals because and we know that anything power equals . On the other hand, if , then equals for the same reason.
您可能已经注意到我们使用了乘积运算符在目标函数中而不是求和运算符用于线性回归。这是因为观察到的可能性标签为示例是每个观察结果的可能性的乘积(假设所有观察结果彼此独立,情况就是如此)。您可以与概率论中一系列独立实验中结果概率的乘法进行类比。
You may have noticed that we used the product operator in the objective function instead of the sum operator which was used in linear regression. It’s because the likelihood of observing labels for examples is the product of likelihoods of each observation (assuming that all observations are independent of one another, which is the case). You can draw a parallel with the multiplication of probabilities of outcomes in a series of independent experiments in the probability theory.
因为模型中使用的函数,实际上更方便,避免数值溢出,最大化对数似然而不是似然。对数似然定义如下:
Because of the function used in the model, in practice, it’s more convenient, to avoid numerical overflow, to maximize the log-likelihood instead of likelihood. The log-likelihood is defined as follows:
因为是一个严格递增函数,最大化该函数与最大化其参数相同,并且这个新优化问题的解与原始问题的解相同。
Because is a strictly increasing function, maximizing this function is the same as maximizing its argument, and the solution to this new optimization problem is the same as the solution to the original problem.
与线性回归相反,上述优化问题没有封闭形式的解决方案。在这种情况下使用的典型数值优化过程是梯度下降。我们将在下一章讨论它。
Contrary to linear regression, there’s no closed form solution to the above optimization problem. A typical numerical optimization procedure used in such cases is gradient descent. We talk about it in the next chapter.
决策树是一种可用于做出决策的非循环图。在图的每个分支节点中,都有一个特定的特征检查特征向量的。如果特征值低于特定阈值,则遵循左分支;否则,遵循正确的分支。当到达叶节点时,将决定示例所属的类。
A decision tree is an acyclic graph that can be used to make decisions. In each branching node of the graph, a specific feature of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed; otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs.
正如本节标题所示,可以从数据中学习决策树。
As the title of the section suggests, a decision tree can be learned from data.
像以前一样,我们有一组带标签的示例;标签属于该集合。我们想要构建一个决策树,它允许我们在给定特征向量的情况下预测类别。
Like previously, we have a collection of labeled examples; labels belong to the set . We want to build a decision tree that would allow us to predict the class given a feature vector.
决策树学习算法有多种形式。在本书中,我们只考虑一个,称为ID3。
There are various formulations of the decision tree learning algorithm. In this book, we consider just one, called ID3.
在这种情况下,优化标准是平均对数似然:
The optimization criterion, in this case, is the average log-likelihood:
在哪里是一棵决策树。
where is a decision tree.
到目前为止,它看起来与逻辑回归非常相似。然而,与构建参数模型的逻辑回归学习算法相反 通过寻找优化准则的最优解,ID3算法通过构建非参数模型对其进行近似优化 。
By now, it looks very similar to logistic regression. However, contrary to the logistic regression learning algorithm which builds a parametric model by finding an optimal solution to the optimization criterion, the ID3 algorithm optimizes it approximately by constructing a nonparametric model .
ID3学习算法的工作原理如下。让表示一组带标签的示例。一开始,决策树只有一个包含所有示例的起始节点:。从恒定模型开始定义为,
The ID3 learning algorithm works as follows. Let denote a set of labeled examples. In the beginning, the decision tree only has a start node that contains all examples: . Start with a constant model defined as,
上述模型给出的预测,,对于任何输入都相同。使用玩具数据集构建的相应决策树标记示例如图所示。 10 .
The prediction given by the above model, , would be the same for any input . The corresponding decision tree built using a toy dataset of labeled examples is shown in fig. 10.
然后我们搜索所有特征和所有阈值,并分割集合分为两个子集:和。两个新的子集将进入两个新的叶节点,并且我们对所有可能的对进行评估分割成碎片有多好和是。最后,我们选择最好的值, 分裂进入和,形成两个新的叶子节点,并继续递归和(或者如果没有分割产生比当前模型足够好的模型则退出)。一次分裂后的决策树如图 1 所示。 11 .
Then we search through all features and all thresholds , and split the set into two subsets: and . The two new subsets would go to two new leaf nodes, and we evaluate, for all possible pairs how good the split with pieces and is. Finally, we pick the best such values , split into and , form two new leaf nodes, and continue recursively on and (or quit if no split produces a model that’s sufficiently better than the current one). A decision tree after one split is illustrated in fig. 11.
现在你应该想知道“评估分裂有多好”这句话是什么意思。在 ID3 中,分割的优劣是通过使用称为熵的标准来估计的。熵是随机变量不确定性的度量。当随机变量的所有值均等概率时,它达到最大值。当随机变量只能有一个值时,熵达到最小值。一组例子的熵是(谁)给的,
Now you should wonder what do the words “evaluate how good the split is” mean. In ID3, the goodness of a split is estimated by using the criterion called entropy. Entropy is a measure of uncertainty about a random variable. It reaches its maximum when all values of the random variables are equiprobable. Entropy reaches its minimum when the random variable can have only one value. The entropy of a set of examples is given by,
当我们按某个特征拆分一组示例时和一个阈值,分裂的熵,,只是两个熵的加权和:
When we split a set of examples by a certain feature and a threshold , the entropy of a split, , is simply a weighted sum of two entropies:
因此,在 ID3 中,在每一步,在每个叶节点,我们找到一个分裂,使等式给出的熵最小化。 13或者我们停在这个叶节点。
So, in ID3, at each step, at each leaf node, we find a split that minimizes the entropy given by eq. 13 or we stop at this leaf node.
在以下任一情况下,算法将停止在叶节点:
The algorithm stops at a leaf node in any of the below situations:
因为在 ID3 中,每次迭代时分割数据集的决定是局部的(不依赖于未来的分割),所以该算法不能保证最佳解决方案。可以通过在搜索最佳决策树期间使用回溯等技术来改进模型,但代价是可能需要更长的时间来构建模型。
Because in ID3, the decision to split the dataset on each iteration is local (doesn’t depend on future splits), the algorithm doesn’t guarantee an optimal solution. The model can be improved by using techniques like backtracking during the search for the optimal decision tree at the cost of possibly taking longer to build a model.
最广泛使用的决策树学习算法公式称为C4.5。与 ID3 相比,它有几个附加功能:
The most widely used formulation of a decision tree learning algorithm is called C4.5. It has several additional features as compared to ID3:
修剪包括在树创建后返回,并通过用叶节点替换那些对减少错误贡献不够显着的分支来删除它们。
Pruning consists of going back through the tree once it’s been created and removing branches that don’t contribute significantly enough to the error reduction by replacing them with leaf nodes.
基于熵的分割标准直观上是有意义的:熵达到最小值当所有的例子都在具有相同的标签;另一方面,熵达到最大值当恰好有一半的例子被标记为,使得这样的叶子对于分类毫无用处。唯一剩下的问题是该算法如何近似最大化平均对数似然标准。我将其留待进一步阅读。
The entropy-based split criterion intuitively makes sense: entropy reaches its minimum of when all examples in have the same label; on the other hand, the entropy is at its maximum of when exactly one-half of examples in is labeled with , making such a leaf useless for classification. The only remaining question is how this algorithm approximately maximizes the average log-likelihood criterion. I leave it for further reading.
我已经在简介中介绍了 SVM,因此本节仅填补几个空白。需要回答两个关键问题:
I already presented SVM in the introduction, so this section only fills a couple of blanks. Two critical questions need to be answered:
您可以看到图 1 中描述的两种情况。 12和图12 13 .在左侧的情况下,如果没有噪声(异常值或带有错误标签的示例),数据可以用直线分隔。在正确的情况下,决策边界是一个圆而不是一条直线。
You can see both situations depicted in fig. 12 and fig. 13. In the left case, the data could be separated by a straight line if not for the noise (outliers or examples with wrong labels). In the right case, the decision boundary is a circle and not a straight line.
请记住,在 SVM 中,我们希望满足以下约束:
Remember that in SVM, we want to satisfy the following constraints:
我们也想尽量减少使得超平面与每个类中最近的示例的距离相等。最小化相当于最小化,并且该术语的使用使得稍后执行二次规划优化成为可能。因此,SVM 的优化问题如下所示:
We also want to minimize so that the hyperplane is equally distant from the closest examples of each class. Minimizing is equivalent to minimizing , and the use of this term makes it possible to perform quadratic programming optimization later on. The optimization problem for SVM, therefore, looks like this:
为了将 SVM 扩展到数据不可线性分离的情况,我们引入了铰链损失函数:。
To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function: .
如果满足 中的约束,铰链损失函数为零;换句话说,如果位于决策边界的正确一侧。对于决策边界错误一侧的数据,函数的值与距决策边界的距离成正比。
The hinge loss function is zero if the constraints in are satisfied; in other words, if lies on the correct side of the decision boundary. For data on the wrong side of the decision boundary, the function’s value is proportional to the distance from the decision boundary.
然后我们希望最小化以下成本函数,
We then wish to minimize the following cost function,
其中超参数确定增加决策边界的大小和确保每个决策边界之间的权衡位于决策边界的正确一侧。的价值通常是通过实验选择的,就像 ID3 的超参数一样和。优化铰链损失的 SVM 称为软边缘SVM,而原始公式称为硬边缘SVM。
where the hyperparameter determines the tradeoff between increasing the size of the decision boundary and ensuring that each lies on the correct side of the decision boundary. The value of is usually chosen experimentally, just like ID3’s hyperparameters and . SVMs that optimize hinge loss are called soft-margin SVMs, while the original formulation is referred to as a hard-margin SVM.
正如您所看到的,对于足够高的值,成本函数中的第二项将变得可以忽略不计,因此 SVM 算法将尝试通过完全忽略错误分类来找到最高边际。当我们减少,分类错误的成本变得越来越高,因此 SVM 算法试图通过牺牲边距大小来减少错误。正如我们已经讨论过的,更大的余量更有利于泛化。所以,调节对训练数据进行良好分类(最小化经验风险)和对未来示例进行良好分类(泛化)之间的权衡。
As you can see, for sufficiently high values of , the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification. As we decrease the value of , making classification errors is becoming more costly, so the SVM algorithm tries to make fewer mistakes by sacrificing the margin size. As we have already discussed, a larger margin is better for generalization. Therefore, regulates the tradeoff between classifying the training data well (minimizing empirical risk) and classifying future examples well (generalization).
SVM 可以适用于处理原始空间中无法被超平面分隔的数据集。事实上,如果我们设法将原始空间转换为更高维度的空间,我们可以希望这些例子在这个转换后的空间中变得线性可分。在SVM中,在成本函数优化过程中使用函数将原始空间隐式变换到更高维空间称为核技巧。
SVM can be adapted to work with datasets that cannot be separated by a hyperplane in its original space. Indeed, if we manage to transform the original space into a space of higher dimensionality, we could hope that the examples will become linearly separable in this transformed space. In SVMs, using a function to implicitly transform the original space into a higher dimensional space during the cost function optimization is called the kernel trick.
应用核技巧的效果如图 2 所示。 14 .正如您所看到的,可以使用特定的映射将二维非线性可分离数据转换为线性可分离三维数据, 在哪里是一个比维数更高的向量。以图 2D 数据为例。 13)、映射对于该项目,有一个 2D 示例进入 3D 空间(图 14)将如下所示:, 在哪里方法平方。您现在看到数据在变换后的空间中变得线性可分。
The effect of applying the kernel trick is illustrated in fig. 14. As you can see, it’s possible to transform a two-dimensional non-linearly-separable data into a linearly-separable three-dimensional data using a specific mapping , where is a vector of higher dimensionality than . For the example of 2D data in fig. 13), the mapping for that projects a 2D example into a 3D space (fig. 14) would look like this: , where means squared. You see now that the data becomes linearly separable in the transformed space.
然而,我们不知道先验哪个映射对我们的数据有用。如果我们首先使用某种映射将所有输入示例转换为非常高维的向量,然后将 SVM 应用于该数据,并尝试所有可能的映射函数,则计算可能会变得非常低效,并且我们永远无法解决分类问题。
However, we don’t know a priori which mapping would work for our data. If we first transform all our input examples using some mapping into very high dimensional vectors and then apply SVM to this data, and we try all possible mapping functions, the computation could become very inefficient, and we would never solve our classification problem.
幸运的是,科学家们弄清楚了如何使用核函数(或者简单地说,核)在高维空间中有效地工作,而无需显式地进行这种转换。要了解核的工作原理,我们必须首先了解 SVM 的优化算法如何找到最优值和。
Fortunately, scientists figured out how to use kernel functions (or, simply, kernels) to efficiently work in higher-dimensional spaces without doing this transformation explicitly. To understand how kernels work, we have to see first how the optimization algorithm for SVM finds the optimal values for and .
传统上用于解决方程中的优化问题的方法。 15是拉格朗日乘子法。而不是从等式解决原始问题。 15,解决这样的等价问题很方便:
The method traditionally used to solve the optimization problem in eq. 15 is the method of Lagrange multipliers. Instead of solving the original problem from eq. 15, it is convenient to solve an equivalent problem formulated like this:
在哪里称为拉格朗日乘子。当这样表达时,优化问题就变成了凸二次优化问题,可以通过二次规划算法有效求解。
where are called Lagrange multipliers. When formulated like this, the optimization problem becomes a convex quadratic optimization problem, efficiently solvable by quadratic programming algorithms.
现在,您可能已经注意到,在上面的公式中,有一个术语,这是唯一使用特征向量的地方。如果我们想将向量空间变换到高维空间,我们需要变换进入和进入然后乘以和。这样做的成本会非常高。
Now, you could have noticed that in the above formulation, there is a term , and this is the only place where the feature vectors are used. If we want to transform our vector space into higher dimensional space, we need to transform into and into and then multiply and . Doing so would be very costly.
另一方面,我们只对点积的结果感兴趣,正如我们所知,它是一个实数。我们不在乎这个数字是如何获得的,只要它是正确的。通过使用核技巧,我们可以摆脱将原始特征向量转换为高维向量的昂贵转换,并避免计算它们的点积。我们通过对原始特征向量进行简单的操作来替换它,得到相同的结果。例如,不是变换进入和进入然后计算点积和获得我们可以找到之间的点积和要得到然后对其进行平方以获得完全相同的结果。
On the other hand, we are only interested in the result of the dot-product , which, as we know, is a real number. We don’t care how this number was obtained as long as it’s correct. By using the kernel trick, we can get rid of a costly transformation of original feature vectors into higher-dimensional vectors and avoid computing their dot-product. We replace that by a simple operation on the original feature vectors that gives the same result. For example, instead of transforming into and into and then computing the dot-product of and to obtain we could find the dot-product between and to get and then square it to get exactly the same result .
这是核技巧的一个例子,我们使用了二次核。存在多种核函数,其中最广泛使用的是RBF 核:
That was an example of the kernel trick, and we used the quadratic kernel . Multiple kernel functions exist, the most widely used of which is the RBF kernel:
在哪里是两个特征向量之间的欧氏距离平方。欧几里得距离由以下等式给出:
where is the squared Euclidean distance between two feature vectors. The Euclidean distance is given by the following equation:
可以证明,RBF(“径向基函数”)核的特征空间具有无限维数。通过改变超参数,数据分析师可以选择在原始空间中获得平滑或弯曲的决策边界。
It can be shown that the feature space of the RBF (for “radial basis function”) kernel has an infinite number of dimensions. By varying the hyperparameter , the data analyst can choose between getting a smooth or curvy decision boundary in the original space.
k 最近邻(kNN) 是一种非参数学习算法。与其他允许在模型构建后丢弃训练数据的学习算法相反,kNN 将所有训练示例保留在内存中。曾经是一个新的、前所未见的例子进来后,kNN算法发现最接近的训练示例如果是分类,则返回多数标签;如果是回归,则返回平均标签。
k-Nearest Neighbors (kNN) is a non-parametric learning algorithm. Contrary to other learning algorithms that allow discarding the training data after the model is built, kNN keeps all training examples in memory. Once a new, previously unseen example comes in, the kNN algorithm finds training examples closest to and returns the majority label, in case of classification, or the average label, in case of regression.
两个例子的接近度由距离函数给出。例如,上面看到的欧几里德距离在实践中经常使用。距离函数的另一个流行选择是负余弦相似度。余弦相似度定义为,
The closeness of two examples is given by a distance function. For example, Euclidean distance seen above is frequently used in practice. Another popular choice of the distance function is the negative cosine similarity. Cosine similarity defined as,
是两个向量方向相似度的度量。如果两个向量之间的角度是度,则两个向量指向同一方向,余弦相似度等于。如果向量正交,则余弦相似度为。对于指向相反方向的向量,余弦相似度为。如果我们想使用余弦相似度作为距离度量,我们需要将其乘以。其他流行的距离度量包括切比雪夫距离、马哈拉诺比斯距离和汉明距离。距离度量的选择以及值,是分析师在运行算法之前做出的选择。所以这些是超参数。距离度量也可以从数据中学习(而不是猜测)。我们将在第 10 章中讨论这一点。
is a measure of similarity of the directions of two vectors. If the angle between two vectors is degrees, then two vectors point to the same direction, and cosine similarity is equal to . If the vectors are orthogonal, the cosine similarity is . For vectors pointing in opposite directions, the cosine similarity is . If we want to use cosine similarity as a distance metric, we need to multiply it by . Other popular distance metrics include Chebychev distance, Mahalanobis distance, and Hamming distance. The choice of the distance metric, as well as the value for , are the choices the analyst makes before running the algorithm. So these are hyperparameters. The distance metric could also be learned from data (as opposed to guessing it). We talk about that in Chapter 10.
这么说是实值,我们写, 在哪里表示所有实数的集合,即从负无穷大到正无穷大的无限数集合。↩
To say that is real-valued, we write , where denotes the set of all real numbers, an infinite set of numbers from minus infinity to plus infinity.↩
为了找到函数的最小值或最大值,我们将梯度设置为零,因为函数极值处的梯度值始终为零。在二维中,极值处的梯度是一条水平线。↩
To find the minimum or the maximum of a function, we set the gradient to zero because the value of the gradient at extrema of a function is always zero. In 2D, the gradient at an extremum is a horizontal line.↩
在第 5 章中,我将在超参数调整部分展示如何做到这一点。↩
In Chapter 5, I show how to do that in the section on hyperparameter tuning.↩
通过阅读前一章,您可能已经注意到,我们看到的每个学习算法都由三个部分组成:
You may have noticed by reading the previous chapter that each learning algorithm we saw consisted of three parts:
这些是任何学习算法的构建块。您在上一章中看到,一些算法被设计为显式优化特定标准(线性回归和逻辑回归,SVM)。其他一些,包括决策树学习和 kNN,隐式优化标准。决策树学习和 kNN 是最古老的机器学习算法之一,是基于直觉通过实验发明的,没有考虑特定的全局优化标准,并且(就像科学史上经常发生的那样)优化标准是后来开发的,以解释为什么这些算法工作。
These are the building blocks of any learning algorithm. You saw in the previous chapter that some algorithms were designed to explicitly optimize a specific criterion (both linear and logistic regressions, SVM). Some others, including decision tree learning and kNN, optimize the criterion implicitly. Decision tree learning and kNN are among the oldest machine learning algorithms and were invented experimentally based on intuition, without a specific global optimization criterion in mind, and (like it often happened in scientific history) the optimization criteria were developed later to explain why those algorithms work.
通过阅读有关机器学习的现代文献,您经常会遇到梯度下降或随机梯度下降的参考。这是两种最常用的优化算法,用于优化标准可微分的情况。
By reading modern literature on machine learning, you often encounter references to gradient descent or stochastic gradient descent. These are two most frequently used optimization algorithms used in cases where the optimization criterion is differentiable.
梯度下降是一种用于寻找函数最小值的迭代优化算法。为了使用梯度下降找到函数的局部最小值,我们从某个随机点开始,并采取与当前点处函数的梯度(或近似梯度)的负值成比例的步骤。
Gradient descent is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one starts at some random point and takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
梯度下降可用于寻找线性回归、逻辑回归、SVM 以及我们稍后考虑的神经网络的最佳参数。对于许多模型,例如逻辑回归或 SVM,优化标准是凸的。凸函数只有一个最小值,即全局的。神经网络的优化标准不是凸的,但在实践中甚至找到局部最小值就足够了。
Gradient descent can be used to find optimal parameters for linear and logistic regression, SVM and also neural networks which we consider later. For many models, such as logistic regression or SVM, the optimization criterion is convex. Convex functions have only one minimum, which is global. Optimization criteria for neural networks are not convex, but in practice even finding a local minimum suffices.
让我们看看梯度下降是如何工作的。
Let’s see how gradient descent works.
在本节中,我将演示梯度下降如何找到线性回归问题1的解决方案。我用 Python 代码以及图表来说明我的描述,这些图表显示了在梯度下降的一些迭代之后解决方案如何改进。我使用的数据集只有一个特征。然而,优化标准将有两个参数:和。多维训练数据的扩展很简单:你有变量,, 和对于二维数据,,,, 和用于三维数据等。
In this section, I demonstrate how gradient descent finds the solution to a linear regression problem1. I illustrate my description with Python code as well as with plots that show how the solution improves after some iterations of gradient descent. I use a dataset with only one feature. However, the optimization criterion will have two parameters: and . The extension to multi-dimensional training data is straightforward: you have variables , , and for two-dimensional data, , , , and for three-dimensional data and so on.
为了举一个实际的例子,我使用了真实的数据集(可以在本书的维基百科上找到)以及以下列:各公司每年在广播广告上的支出以及以销售量计算的年度销售额。我们希望建立一个回归模型,可以根据公司在广播广告上的支出来预测销量。数据集中的每一行代表一个特定的公司:
To give a practical example, I use the real dataset (can be found on the book’s wiki) with the following columns: the Spendings of various companies on radio advertising each year and their annual Sales in terms of units sold. We want to build a regression model that we can use to predict units sold based on how much a company spends on radio advertising. Each row in the dataset represents one specific company:
| 公司 | 支出,百万元 | 销量,单位 |
|---|---|---|
| 1 | 37.8 | 22.1 |
| 2 | 39.3 | 10.4 |
| 3 | 45.9 | 9.3 |
| 4 | 41.3 | 18.5 |
| .. | .. | .. |
我们有 200 家公司的数据,因此我们有 200 个以下形式的训练示例。在图中。 15,您可以在二维图上查看所有示例。
We have data for 200 companies, so we have 200 training examples in the form . In fig. 15, you can see all examples on a 2D plot.
请记住,线性回归模型如下所示:。我们不知道最佳值是多少和是,我们想从数据中学习它们。为此,我们寻找这样的值和最小化均方误差:
Remember that the linear regression model looks like this: . We don’t know what the optimal values for and are and we want to learn them from data. To do that, we look for such values for and that minimize the mean squared error:
梯度下降从计算每个参数的偏导数开始:
Gradient descent starts with calculating the partial derivative for every parameter:
求项的偏导数关于我应用了链式法则。在这里,我们有链条在哪里和。求一个偏导数关于我们必须首先找到偏导数关于这等于(根据微积分,我们知道导数)然后我们必须将其乘以偏导数关于这等于。所以总体来说。类似地,求偏导数关于,, 进行了计算。
To find the partial derivative of the term with respect to I applied the chain rule. Here, we have the chain where and . To find a partial derivative of with respect to we have to first find the partial derivative of with respect to which is equal to (from calculus, we know that the derivative ) and then we have to multiply it by the partial derivative of with respect to which is equal to . So overall . In a similar way, the partial derivative of with respect to , , was calculated.
梯度下降以epoch为单位进行。一个时期包括完全使用训练集来更新每个参数。一开始,第一个纪元,我们初始化2 和。偏导数,和由等式给出分别相等,和。在每个时期,我们都会更新和使用偏导数。学习率 控制更新的大小:
Gradient descent proceeds in epochs. An epoch consists of using the training set entirely to update each parameter. In the beginning, the first epoch, we initialize2 and . The partial derivatives, and given by eq. equal, respectively, and . At each epoch, we update and using partial derivatives. The learning rate controls the size of an update:
我们从参数值中减去(而不是添加)偏导数,因为导数是函数增长的指标。如果导数在某个点3为正,则函数在该点增长。因为我们想要最小化目标函数,所以当导数为正时,我们知道我们需要向相反方向移动参数(向坐标轴的左侧)。当导数为负(函数递减)时,我们需要将参数向右移动以进一步减小函数的值。从参数中减去负值会将其移至右侧。
We subtract (as opposed to adding) partial derivatives from the values of parameters because derivatives are indicators of growth of a function. If a derivative is positive at some point3, then the function grows at this point. Because we want to minimize the objective function, when the derivative is positive we know that we need to move our parameter in the opposite direction (to the left on the axis of coordinates). When the derivative is negative (function is decreasing), we need to move our parameter to the right to decrease the value of the function even more. Subtracting a negative value from a parameter moves it to the right.
在下一个纪元,我们使用等式重新计算偏导数。更新后的值和;我们继续这个过程直到收敛。通常,我们需要很多纪元,直到我们开始看到和每个纪元之后都不会发生太大变化;然后我们停下来。
At the next epoch, we recalculate partial derivatives using eq. with the updated values of and ; we continue the process until convergence. Typically, we need many epochs until we start seeing that the values for and don’t change much after each epoch; then we stop.
很难想象一个机器学习工程师不使用Python。因此,如果您正在等待开始学习 Python 的合适时机,那么现在就是时候了。下面,我展示了如何在 Python 中编写梯度下降程序。
It’s hard to imagine a machine learning engineer who doesn’t use Python. So, if you waited for the right moment to start learning Python, this is that moment. Below, I show how to program gradient descent in Python.
更新参数的函数和一个 epoch 期间的情况如下所示:
The function that updates the parameters and during one epoch is shown below:
def update_w_and_b(spendings, sales, w, b, alpha):
dl_dw = 0.0
dl_db = 0.0
N = len(spendings)
for i in range(N):
dl_dw += -2*spendings[i]*(sales[i] - (w*spendings[i] + b))
dl_db += -2*(sales[i] - (w*spendings[i] + b))
# update w and b
w = w - (1/float(N))*dl_dw*alpha
b = b - (1/float(N))*dl_db*alpha
return w, bdef update_w_and_b(spendings, sales, w, b, alpha):
dl_dw = 0.0
dl_db = 0.0
N = len(spendings)
for i in range(N):
dl_dw += -2*spendings[i]*(sales[i] - (w*spendings[i] + b))
dl_db += -2*(sales[i] - (w*spendings[i] + b))
# update w and b
w = w - (1/float(N))*dl_dw*alpha
b = b - (1/float(N))*dl_db*alpha
return w, b循环多个纪元的函数如下所示:
The function that loops over multiple epochs is shown below:
def train(spendings, sales, w, b, alpha, epochs):
for e in range(epochs):
w, b = update_w_and_b(spendings, sales, w, b, alpha)
# log the progress
if e % 400 == 0:
print("epoch:", e, "loss: ", avg_loss(spendings, sales, w, b))
return w, bdef train(spendings, sales, w, b, alpha, epochs):
for e in range(epochs):
w, b = update_w_and_b(spendings, sales, w, b, alpha)
# log the progress
if e % 400 == 0:
print("epoch:", e, "loss: ", avg_loss(spendings, sales, w, b))
return w, b上面代码片段中的函数avg_loss是计算均方误差的函数。它定义为:
The function avg_loss in the above code snippet is a function that computes the mean squared error. It is defined as:
def avg_loss(spendings, sales, w, b):
N = len(spendings)
total_error = 0.0
for i in range(N):
total_error += (sales[i] - (w*spendings[i] + b))**2
return total_error / float(N)def avg_loss(spendings, sales, w, b):
N = len(spendings)
total_error = 0.0
for i in range(N):
total_error += (sales[i] - (w*spendings[i] + b))**2
return total_error / float(N)如果我们运行训练函数,,,和 15,000 个 epoch,我们将看到以下输出(部分显示):
If we run the train function for , , , and 15,000 epochs, we will see the following output (shown partially):
epoch: 0 loss: 92.32078294903626
epoch: 400 loss: 33.79131790081576
epoch: 800 loss: 27.9918542960729
epoch: 1200 loss: 24.33481690722147
epoch: 1600 loss: 22.028754937538633
...
epoch: 2800 loss: 19.07940244306619epoch: 0 loss: 92.32078294903626
epoch: 400 loss: 33.79131790081576
epoch: 800 loss: 27.9918542960729
epoch: 1200 loss: 24.33481690722147
epoch: 1600 loss: 22.028754937538633
...
epoch: 2800 loss: 19.07940244306619
您可以看到,随着训练函数循环历元,平均损失会减少。在图中。 16您可以看到回归线在各个时期的演变。
You can see that the average loss decreases as the train function loops through epochs. In fig. 16 you can see the evolution of the regression line through epochs.
最后,一旦我们找到了参数的最佳值和,唯一缺少的部分是进行预测的函数:
Finally, once we have found the optimal values of parameters and , the only missing piece is a function that makes predictions:
尝试执行以下代码:
Try to execute the following code:
w, b = train(x, y, 0.0, 0.0, 0.001, 15000)
x_new = 23.0
y_new = predict(x_new, w, b)
print(y_new)w, b = train(x, y, 0.0, 0.0, 0.001, 15000)
x_new = 23.0
y_new = predict(x_new, w, b)
print(y_new)输出是。
The output is .
梯度下降对学习率的选择很敏感。对于大型数据集来说它也很慢。幸运的是,已经对该算法提出了一些重大改进。
Gradient descent is sensitive to the choice of the learning rate . It is also slow for large datasets. Fortunately, several significant improvements to this algorithm have been proposed.
小批量随机梯度下降(小批量 SGD)是该算法的一个版本,它通过使用训练数据的较小批量(子集)来近似梯度来加速计算。 SGD本身有各种“升级”。Adagrad是 SGD 的一个可扩展版本根据梯度历史记录每个参数。因此,对于非常大的梯度会减小,反之亦然。动量是一种通过将梯度下降定向到相关方向并减少振荡来帮助加速 SGD 的方法。在神经网络训练中,经常使用SGD 的变体,例如RMSprop和Adam 。
Minibatch stochastic gradient descent (minibatch SGD) is a version of the algorithm that speeds up the computation by approximating the gradient using smaller batches (subsets) of the training data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales for each parameter according to the history of gradients. As a result, is reduced for very large gradients and vice-versa. Momentum is a method that helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations. In neural network training, variants of SGD such as RMSprop and Adam, are very frequently used.
请注意,梯度下降及其变体不是机器学习算法。它们是最小化问题的求解器,其中要最小化的函数具有梯度(在其域的大多数点)。
Notice that gradient descent and its variants are not machine learning algorithms. They are solvers of minimization problems in which the function to minimize has a gradient (in most points of its domain).
除非您是研究科学家或在拥有大量研发预算的大公司工作,否则您通常不会自己实现机器学习算法。您也没有实现梯度下降或其他求解器。您使用库,其中大部分都是开源的。库是算法和支持工具的集合,其实现时考虑到了稳定性和效率。实践中最常用的开源机器学习库是scikit-learn。它是用 Python 和 C 编写的。以下是在 scikit-learn 中进行线性回归的方法:
Unless you are a research scientist or work for a huge corporation with a large R&D budget, you usually don’t implement machine learning algorithms yourself. You don’t implement gradient descent or some other solver either. You use libraries, most of which are open source. A library is a collection of algorithms and supporting tools implemented with stability and efficiency in mind. The most frequently used in practice open-source machine learning library is scikit-learn. It’s written in Python and C. Here’s how you do linear regression in scikit-learn:
def train(x, y):
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x,y)
return model
model = train(x,y)
x_new = 23.0
y_new = model.predict(x_new)
print(y_new)def train(x, y):
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x,y)
return model
model = train(x,y)
x_new = 23.0
y_new = model.predict(x_new)
print(y_new)输出将再次是。容易,对吧?您可以将 LinearRegression 替换为其他类型的回归学习算法,而无需修改任何其他内容。它就是有效的。关于分类也是如此。您可以轻松地将LogisticRegression算法替换为SVC算法(这是 scikit-learn 对支持向量机算法的名称)、DecisionTreeClassifier、NearestNeighbors或 scikit-learn 中实现的许多其他分类学习算法。
The output will, again, be . Easy, right? You can replace LinearRegression with some other type of regression learning algorithm without modifying anything else. It just works. The same can be said about classification. You can easily replace LogisticRegression algorithm with SVC algorithm (this is scikit-learn’s name for the Support Vector Machine algorithm), DecisionTreeClassifier, NearestNeighbors or many other classification learning algorithms implemented in scikit-learn.
在这里,我概述了一些可以区分一种学习算法和另一种学习算法的实际特性。您已经知道不同的学习算法可以有不同的超参数(在支持向量机中,和在 ID3 中)。诸如梯度下降之类的求解器也可以具有超参数,例如例如。
Here, I outline some practical particularities that can differentiate one learning algorithm from another. You already know that different learning algorithms can have different hyperparameters ( in SVM, and in ID3). Solvers such as gradient descent can also have hyperparameters, like for example.
某些算法(例如决策树学习)可以接受分类特征。例如,如果您有一个特征“颜色”,可以采用“红色”、“黄色”或“绿色”值,则可以保持此特征不变。 SVM、逻辑回归和线性回归以及 kNN(具有余弦相似度或欧几里得距离度量)期望所有特征都有数值。 scikit-learn 中实现的所有算法都期望数字特征。在下一章中,我将展示如何将分类特征转换为数值特征。
Some algorithms, like decision tree learning, can accept categorical features. For example, if you have a feature “color” that can take values “red”, “yellow”, or “green”, you can keep this feature as is. SVM, logistic and linear regression, as well as kNN (with cosine similarity or Euclidean distance metrics), expect numerical values for all features. All algorithms implemented in scikit-learn expect numerical features. In the next chapter, I show how to convert categorical features into numerical ones.
某些算法(例如 SVM)允许数据分析师为每个类别提供权重。这些权重影响决策边界的绘制方式。如果某个类别的权重很高,学习算法会尝试在预测该类别的训练示例时不犯错误(通常,以在其他地方犯错误为代价)。如果某些类的实例在训练数据中占少数,那么这一点可能很重要,但您希望尽可能避免对该类的示例进行错误分类。
Some algorithms, like SVM, allow the data analyst to provide weightings for each class. These weightings influence how the decision boundary is drawn. If the weight of some class is high, the learning algorithm tries to not make errors in predicting training examples of this class (typically, for the cost of making an error elsewhere). That could be important if instances of some class are in the minority in your training data, but you would like to avoid misclassifying examples of that class as much as possible.
一些分类模型,如 SVM 和 kNN,给定特征向量时仅输出类别。其他的,如逻辑回归或决策树,也可以返回之间的分数和它可以解释为模型对预测的置信度或输入示例属于某个类别4的概率。
Some classification models, like SVM and kNN, given a feature vector only output the class. Others, like logistic regression or decision trees, can also return the score between and which can be interpreted as either how confident the model is about the prediction or as the probability that the input example belongs to a certain class4.
一些分类算法(例如决策树学习、逻辑回归或 SVM)会立即使用整个数据集构建模型。如果您有额外的标记示例,则必须从头开始重建模型。其他算法(例如 scikit-learn 中的朴素贝叶斯、多层感知器、SGDClassifier/SGDRegressor、PassiveAggressiveClassifier/PassiveAggressiveRegressor)可以迭代训练,一次一批。一旦有新的训练示例可用,您就可以仅使用新数据更新模型。
Some classification algorithms (like decision tree learning, logistic regression, or SVM) build the model using the whole dataset at once. If you have got additional labeled examples, you have to rebuild the model from scratch. Other algorithms (such as Naïve Bayes, multilayer perceptron, SGDClassifier/SGDRegressor, PassiveAggressiveClassifier/PassiveAggressiveRegressor in scikit-learn) can be trained iteratively, one batch at a time. Once new training examples are available, you can update the model using only the new data.
最后,一些算法,如决策树学习、SVM 和 kNN,可以同时用于分类和回归,而其他算法只能解决一类问题:分类或回归,但不能同时解决两者。
Finally, some algorithms, like decision tree learning, SVM, and kNN can be used for both classification and regression, while others can only solve one type of problem: either classification or regression, but not both.
通常,每个库都会提供文档来解释每个算法解决什么类型的问题、允许哪些输入值以及模型产生什么类型的输出。该文档还提供了有关超参数的信息。
Usually, each library provides the documentation that explains what kind of problem each algorithm solves, what input values are allowed and what kind of output the model produces. The documentation also provides information on hyperparameters.
如您所知,线性回归有一个封闭式解。这意味着解决这种特定类型的问题不需要梯度下降。然而,出于说明目的,线性回归是解释梯度下降的完美问题。↩
As you know, linear regression has a closed form solution. That means that gradient descent is not needed to solve this specific type of problem. However, for illustration purposes, linear regression is a perfect problem to explain gradient descent.↩
在复杂模型中,例如具有数千个参数的神经网络,参数的初始化可能会显着影响使用梯度下降找到的解决方案。有不同的初始化方法(随机、全零、零附近的小值等等),这是数据分析师必须做出的重要选择。↩
In complex models, such as neural networks, which have thousands of parameters, the initialization of parameters may significantly affect the solution found using gradient descent. There are different initialization methods (at random, with all zeroes, with small values around zero, and others) and it is an important choice the data analyst has to make.↩
点由参数的当前值给出。↩
A point is given by the current values of parameters.↩
如果确实有必要,可以使用简单的技术综合创建 SVM 和 kNN 预测的分数。↩
If it’s really necessary, the score for SVM and kNN predictions could be synthetically created using simple techniques.↩
到目前为止,我只是顺便提到了数据分析师在处理机器学习问题时需要考虑的一些问题:特征工程、过拟合和超参数调整。在本章中,我们将讨论这些以及在您可以键入之前必须解决的其他挑战在 scikit-learn 中。
Until now, I only mentioned in passing some issues that a data analyst needs to consider when working on a machine learning problem: feature engineering, overfitting, and hyperparameter tuning. In this chapter, we talk about these and other challenges that have to be addressed before you can type in scikit-learn.
当产品经理告诉你“我们需要能够预测特定客户是否会留在我们身边。这是五年来客户与我们产品互动的日志。”您不能仅仅获取数据,将其加载到库中并获得预测。您需要首先构建一个数据集。
When a product manager tells you “We need to be able to predict whether a particular customer will stay with us. Here are the logs of customers’ interactions with our product for five years.” you cannot just grab the data, load it into a library and get a prediction. You need to build a dataset first.
请记住第一章中的数据集是标记示例的集合 。每个元素之中称为特征向量。特征向量是一个向量,其中每个维度包含一个以某种方式描述示例的值。该值称为特征,表示为。
Remember from the first chapter that the dataset is the collection of labeled examples . Each element among is called a feature vector. A feature vector is a vector in which each dimension contains a value that describes the example somehow. That value is called a feature and is denoted as .
将原始数据转换为数据集的问题称为特征工程。对于大多数实际问题,特征工程是一个劳动密集型过程,需要数据分析师大量的创造力,最好是领域知识。
The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a labor-intensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge.
例如,为了转换用户与计算机系统交互的日志,可以创建包含有关用户的信息以及从日志中提取的各种统计信息的特征。对于每个用户,一项功能将包含订阅价格;其他功能将包含每天、每周和每年的连接频率。另一项功能将包含平均会话持续时间(以秒为单位)或一个请求的平均响应时间等。一切可测量的东西都可以用作特征。数据分析师的作用是创建信息丰富的特征:这些特征将使学习算法能够构建一个模型,该模型可以很好地预测用于训练的数据的标签。高信息量特征也称为具有高预测能力的特征。例如,用户会话的平均持续时间对于预测用户将来是否会继续使用该应用程序的问题具有很高的预测能力。
For example, to transform the logs of user interaction with a computer system, one could create features that contain information about the user and various statistics extracted from the logs. For each user, one feature would contain the price of the subscription; other features would contain the frequency of connections per day, week and year. Another feature would contain the average session duration in seconds or the average response time for one request, and so on. Everything measurable can be used as a feature. The role of the data analyst is to create informative features: those would allow the learning algorithm to build a model that does a good job of predicting labels of the data used for training. Highly informative features are also called features with high predictive power. For example, the average duration of a user’s session has high predictive power for the problem of predicting whether the user will keep using the application in the future.
当模型能够很好地预测训练数据时,我们说模型具有较低的偏差。也就是说,当我们使用模型来预测用于构建模型的示例的标签时,该模型几乎不会犯错误。
We say that a model has a low bias when it predicts the training data well. That is, the model makes few mistakes when we use it to predict labels of the examples used to build the model.
一些学习算法仅适用于数值特征向量。当数据集中的某些特征是分类特征时,例如“颜色”或“一周中的几天”,您可以将此类分类特征转换为多个二进制特征。
Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones.
如果您的示例具有分类特征“颜色”,并且该特征具有三个可能的值:“红色”、“黄色”、“绿色”,则可以将此特征转换为由三个数值组成的向量:
If your example has a categorical feature “colors” and this feature has three possible values: “red,” “yellow,” “green,” you can transform this feature into a vector of three numerical values:
通过这样做,您可以增加特征向量的维度。你不应该把红色变成, 黄色变为,并将绿色变为避免增加维度,因为这意味着该类别中的值之间存在顺序,并且这种特定顺序对于决策非常重要。如果特征值的顺序不重要,则使用有序数字作为值可能会混淆学习算法,1因为算法会尝试找到不存在的规律性,这可能会导致过度拟合。
By doing so, you increase the dimensionality of your feature vectors. You should not transform red into , yellow into , and green into to avoid increasing the dimensionality because that would imply that there’s an order among the values in this category and this specific order is important for the decision making. If the order of a feature’s values is not important, using ordered numbers as values is likely to confuse the learning algorithm,1 because the algorithm will try to find a regularity where there’s no one, which may potentially lead to overfitting.
另一种相反的情况在实践中很少出现,即您有一个数字特征,但您想将其转换为分类特征。分箱(也称为分桶)是将连续特征转换为多个二进制特征(通常基于值范围)的过程,这些二值特征称为箱或桶。例如,分析师可以将年龄范围划分为离散的容器,而不是将年龄表示为单个实值特征:之间的所有年龄和岁数可以放入一个垃圾箱,到岁数可能在第二个垃圾箱中,到岁数可能位于第三个垃圾箱中,依此类推。
An opposite situation, occurring less frequently in practice, is when you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range. For example, instead of representing age as a single real-valued feature, the analyst could chop ranges of age into discrete bins: all ages between and years-old could be put into one bin, to years-old could be in the second bin, to years-old could be in the third bin, and so on.
例如,让特征代表年龄。通过应用分箱,我们将此功能替换为相应的分箱。让三个新的 bin“age_bin1”、“age_bin2”和“age_bin3”添加索引,和分别。现在如果举个例子,然后我们设置特征到;如果,然后我们设置特征到, 等等。
For example, let feature represent age. By applying binning, we replace this feature with the corresponding bins. Let the three new bins, “age_bin1”, “age_bin2” and “age_bin3” be added with indexes , and respectively. Now if for some example , then we set feature to ; if , then we set feature to , and so on.
在某些情况下,精心设计的分箱可以帮助学习算法使用更少的示例进行学习。发生这种情况是因为我们向学习算法给出了一个“提示”:如果某个特征的值落在特定范围内,则该特征的确切值并不重要。
In some cases, a carefully designed binning can help the learning algorithm to learn using fewer examples. It happens because we give a “hint” to the learning algorithm that if the value of a feature falls within a specific range, the exact value of the feature doesn’t matter.
归一化是将数值特征可以采用的实际值范围转换为标准值范围(通常在区间内)的过程或者。
Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval or .
例如,假设特定特征的自然范围是到。通过减去来自特征的每个值,并将结果除以,可以将这些值归一化到范围内。
For example, suppose the natural range of a particular feature is to . By subtracting from every value of the feature, and dividing the result by , one can normalize those values into the range .
更一般地,归一化公式如下所示:在哪里和分别是特征的最小值和最大值在数据集中。
More generally, the normalization formula looks like this: where and are, respectively, the minimum and the maximum value of the feature in the dataset.
我们为什么要正常化?标准化数据并不是严格的要求。然而,在实践中,它可以提高学习速度。记住上一章中的梯度下降示例。假设您有一个二维特征向量。当您更新参数时和,您使用均方误差的偏导数和。如果是在范围内和范围,那么相对于较大特征的导数将主导更新。
Why do we normalize? Normalizing the data is not a strict requirement. However, in practice, it can lead to an increased speed of learning. Remember the gradient descent example from the previous chapter. Imagine you have a two-dimensional feature vector. When you update the parameters of and , you use partial derivatives of the mean squared error with respect to and . If is in the range and the range , then the derivative with respect to a larger feature will dominate the update.
此外,确保我们的输入大致在相同的相对较小的范围内是有用的,以避免计算机在处理非常小或非常大的数字时出现问题(称为数字溢出)。
Additionally, it’s useful to ensure that our inputs are roughly in the same relatively small range to avoid problems which computers have when working with very small or very big numbers (known as numerical overflow).
标准化(或z 分数标准化)是重新调整特征值的过程,以便它们具有标准正态分布的属性和, 在哪里是平均值(特征的平均值,对数据集中所有示例进行平均)并且是平均值的标准差。
Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with and , where is the mean (the average value of the feature, averaged over all examples in the dataset) and is the standard deviation from the mean.
特征的标准分数(或 z 分数)计算如下:
Standard scores (or z-scores) of features are calculated as follows:
您可能会问什么时候应该使用标准化,什么时候应该使用标准化。这个问题没有明确的答案。通常,如果您的数据集不太大并且您有时间,您可以尝试两者,看看哪一个对您的任务表现更好。
You may ask when you should use normalization and when standardization. There’s no definitive answer to this question. Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task.
如果您没有时间进行多个实验,根据经验:
If you don’t have time to run multiple experiments, as a rule of thumb:
特征重新缩放通常对大多数学习算法有益。然而,您可以在流行的库中找到学习算法的现代实现,它们对于不同范围内的特征具有鲁棒性。
Feature rescaling is usually beneficial to most learning algorithms. However, modern implementations of the learning algorithms, which you can find in popular libraries, are robust to features lying in different ranges.
在某些情况下,数据以具有已定义特征的数据集的形式提供给分析师。在一些示例中,一些特征的值可能缺失。当数据集是手工制作的,并且处理数据集的人员忘记填写某些值或根本没有对它们进行测量时,经常会发生这种情况。
In some cases, the data comes to the analyst in the form of a dataset with features already defined. In some examples, values of some features can be missing. That often happens when the dataset was handcrafted, and the person working on it forgot to fill some values or didn’t get them measured at all.
处理特征缺失值的典型方法包括:
The typical approaches of dealing with missing values for a feature include:
一种数据插补技术包括用数据集中某个特征的平均值替换该特征的缺失值:
One data imputation technique consists in replacing the missing value of a feature by an average value of this feature in the dataset:
另一种技术是用正常值范围之外的值替换缺失值。例如,如果正常范围是,那么您可以将缺失值设置为或者。这个想法是,当特征的值与常规值显着不同时,学习算法将学习最好做什么。或者,您可以用范围中间的值替换缺失值。例如,如果某个特征的范围是,您可以将缺失值设置为等于。这里的想法是,范围中间的值不会显着影响预测。
Another technique is to replace the missing value with a value outside the normal range of values. For example, if the normal range is , then you can set the missing value to or . The idea is that the learning algorithm will learn what is best to do when the feature has a value significantly different from regular values. Alternatively, you can replace the missing value by a value in the middle of the range. For example, if the range for a feature is , you can set the missing value to be equal to . Here, the idea is that the value in the middle of the range will not significantly affect the prediction.
更先进的技术是使用缺失值作为回归问题的目标变量。您可以使用所有剩余功能形成特征向量, 放, 在哪里是具有缺失值的特征。然后你建立一个回归模型来预测从。当然,要构建训练示例,您只使用原始数据集中的那些示例,其中特征的值存在。
A more advanced technique is to use the missing value as the target variable for a regression problem. You can use all remaining features to form a feature vector , set , where is the feature with a missing value. Then you build a regression model to predict from . Of course, to build training examples , you only use those examples from the original dataset, in which the value of feature is present.
最后,如果您有一个非常大的数据集,并且只有几个具有缺失值的特征,则可以通过为每个具有缺失值的特征添加二进制指示符特征来增加特征向量的维数。说说特色吧在你的维数据集有缺失值。对于每个特征向量,然后添加该功能这等于如果特征值存在于和否则。缺失的特征值可以替换为或您选择的任意数量。
Finally, if you have a significantly large dataset and just a few features with missing values, you can increase the dimensionality of your feature vectors by adding a binary indicator feature for each feature with missing values. Let’s say feature in your -dimensional dataset has missing values. For each feature vector , you then add the feature which is equal to if the value of feature is present in and otherwise. The missing feature value then can be replaced by or any number of your choice.
在预测时,如果您的示例不完整,您应该使用与完成训练数据所用的技术相同的数据插补技术来填充缺失的特征。
At prediction time, if your example is not complete, you should use the same data imputation technique to fill the missing features as the technique you used to complete the training data.
在开始解决学习问题之前,您无法判断哪种数据插补技术最有效。尝试多种技术,构建多种模型,然后选择最有效的一种。
Before you start working on the learning problem, you cannot tell which data imputation technique will work the best. Try several techniques, build several models and select the one that works the best.
选择机器学习算法可能是一项艰巨的任务。如果你时间充裕,可以全部都尝试一下。然而,通常解决问题的时间是有限的。在开始解决问题之前,您可以问自己几个问题。根据您的答案,您可以列出一些算法并在您的数据上进行尝试。
Choosing a machine learning algorithm can be a difficult task. If you have much time, you can try all of them. However, usually the time you have to solve a problem is limited. You can ask yourself several questions before starting to work on the problem. Depending on your answers, you can shortlist some algorithms and try them on your data.
您的模型是否必须能够向非技术受众解释?大多数非常准确的学习算法都是所谓的“黑匣子”。他们学习的模型很少出错,但模型为何做出特定预测可能很难理解,甚至更难解释。此类模型的示例是神经网络或集成模型。
Does your model have to be explainable to a non-technical audience? Most very accurate learning algorithms are so-called “black boxes.” They learn models that make very few errors, but why a model made a specific prediction could be very hard to understand and even harder to explain. Examples of such models are neural networks or ensemble models.
另一方面,kNN、线性回归或决策树学习算法生成的模型并不总是最准确的,但是它们的预测方式非常简单。
On the other hand, kNN, linear regression, or decision tree learning algorithms produce models that are not always the most accurate, however, the way they make their prediction is very straightforward.
您的数据集可以完全加载到服务器或个人计算机的 RAM 中吗?如果是,那么您可以从多种算法中进行选择。否则,您会更喜欢增量学习算法,它可以通过逐渐添加更多数据来改进模型。
Can your dataset be fully loaded into the RAM of your server or personal computer? If yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer incremental learning algorithms that can improve the model by adding more data gradually.
您的数据集中有多少个训练示例?每个示例有多少个特征?一些算法,包括神经网络和梯度提升(我们稍后会考虑),可以处理大量的示例和数百万个特征。其他的,比如 SVM,其能力可能非常有限。
How many training examples do you have in your dataset? How many features does each example have? Some algorithms, including neural networks and gradient boosting (we consider both later), can handle a huge number of examples and millions of features. Others, like SVM, can be very modest in their capacity.
您的数据是仅由分类特征组成,还是仅由数值特征组成,还是两者的混合?根据您的答案,某些算法无法直接处理您的数据集,您需要将分类特征转换为数值特征。
Is your data composed of categorical only, or numerical only features, or a mix of both? Depending on your answer, some algorithms cannot handle your dataset directly, and you would need to convert your categorical features into numerical ones.
您的数据是线性可分离的还是可以使用线性模型进行建模?如果是,带有线性核、逻辑回归或线性回归的 SVM 可能是不错的选择。否则,第 6 章和第 7 章中讨论的深度神经网络或集成算法可能会效果更好。
Is your data linearly separable or can it be modeled using a linear model? If yes, SVM with the linear kernel, logistic or linear regression can be good choices. Otherwise, deep neural networks or ensemble algorithms, discussed in Chapters 6 and 7, might work better.
学习算法允许使用多少时间来构建模型?众所周知,神经网络的训练速度很慢。逻辑回归、线性回归或决策树等简单算法要快得多。专门的库包含一些算法的非常有效的实现;您可能更喜欢在线研究以查找此类库。一些算法(例如随机森林)受益于多个 CPU 内核的可用性,因此在具有数十个内核的计算机上可以显着减少其模型构建时间。
How much time is a learning algorithm allowed to use to build a model? Neural networks are known to be slow to train. Simple algorithms like logistic and linear regression or decision trees are much faster. Specialized libraries contain very efficient implementations of some algorithms; you may prefer to do research online to find such libraries. Some algorithms, such as random forests, benefit from the availability of multiple CPU cores, so their model building time can be significantly reduced on a machine with dozens of cores.
模型生成预测时的速度必须有多快?您的模型是否会用于需要非常高吞吐量的生产中?支持向量机、线性回归和逻辑回归以及(某些类型的)神经网络等算法在预测时速度非常快。其他算法,如 kNN、集成算法以及非常深或循环的神经网络,则速度较慢2。
How fast does the model have to be when generating predictions? Will your model be used in production where very high throughput is required? Algorithms like SVMs, linear and logistic regression, and (some types of) neural networks, are extremely fast at the prediction time. Others, like kNN, ensemble algorithms, and very deep or recurrent neural networks, are slower2.
如果您不想猜测最适合您的数据的算法,选择算法的一种流行方法是在验证集上进行测试。我们在下面讨论这一点。或者,如果您使用 scikit-learn,您可以尝试如图 1 所示的算法选择图。 17 .
If you don’t want to guess the best algorithm for your data, a popular way to choose one is by testing it on the validation set. We talk about that below. Alternatively, if you use scikit-learn, you could try their algorithm selection diagram shown in fig. 17.
到目前为止,我交替使用“数据集”和“训练集”这两个表达方式。然而,在实践中,数据分析师使用三组不同的标记示例:
Until now, I used the expressions “dataset” and “training set” interchangeably. However, in practice data analysts work with three distinct sets of labeled examples:
获得带注释的数据集后,您要做的第一件事就是打乱示例并将数据集分为三个子集:训练、验证和测试。训练集通常是最大的;你用它来构建模型。验证集和测试集的大小大致相同,比训练集的大小小得多。学习算法无法使用这两个子集中的示例来构建模型。这就是为什么这两个集合通常被称为保留集合。
Once you have got your annotated dataset, the first thing you do is you shuffle the examples and split the dataset into three subsets: training, validation, and test. The training set is usually the biggest one; you use it to build the model. The validation and test sets are roughly the same sizes, much smaller than the size of the training set. The learning algorithm cannot use examples from these two subsets to build the model. That is why those two sets are often called holdout sets.
将数据集划分为这三个子集没有最佳比例。过去,经验法则是使用 70% 的数据集进行训练,15% 用于验证,15% 用于测试。然而,在大数据时代,数据集通常有数百万个示例。在这种情况下,保留 95% 用于训练和 2.5%/2.5% 用于验证/测试可能是合理的。
There’s no optimal proportion to split the dataset into these three subsets. In the past, the rule of thumb was to use 70% of the dataset for training, 15% for validation and 15% for testing. However, in the age of big data, datasets often have millions of examples. In such cases, it could be reasonable to keep 95% for training and 2.5%/2.5% for validation/testing.
您可能会想,为什么要三套而不是一套呢?答案很简单:当我们构建模型时,我们不希望模型只擅长预测学习算法已经见过的示例标签。简单地记住所有训练样本,然后使用内存来“预测”它们的标签的简单算法在被要求预测训练样本的标签时不会出错,但这样的算法在实践中是无用的。我们真正想要的是一个善于预测学习算法没有看到的例子的模型:我们希望在保留集上有良好的性能。
You may wonder, what is the reason to have three sets and not one. The answer is simple: when we build a model, what we do not want is for the model to only do well at predicting labels of examples the learning algorithms has already seen. A trivial algorithm that simply memorizes all training examples and then uses the memory to “predict” their labels will make no mistakes when asked to predict the labels of the training examples, but such an algorithm would be useless in practice. What we really want is a model that is good at predicting examples that the learning algorithm didn’t see: we want good performance on a holdout set.
为什么我们需要两组而不是一组?我们使用验证集来 1)选择学习算法和 2)找到超参数的最佳值。在将模型交付给客户或投入生产之前,我们使用测试集来评估模型。
Why do we need two holdout sets and not one? We use the validation set to 1) choose the learning algorithm and 2) find the best values of hyperparameters. We use the test set to assess the model before delivering it to the client or putting it in production.
我上面提到了偏见的概念。我说过,如果模型能够很好地预测训练数据的标签,那么它的偏差就较低。如果模型在训练数据上犯了很多错误,我们就说模型有高偏差或者模型欠拟合。因此,欠拟合是指模型无法很好地预测其训练数据的标签。欠拟合可能有多种原因,其中最重要的是:
I mentioned above the notion of bias. I said that a model has a low bias if it predicts well the labels of the training data. If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits. So, underfitting is the inability of the model to predict well the labels of the data it was trained on. There could be several reasons for underfitting, the most important of which are:
第一个原因很容易在一维回归的情况下说明:数据集可以类似于曲线,但我们的模型是直线。第二个原因可以这样说明:假设你想预测一个病人是否患有癌症,你拥有的特征是身高、血压和心率。这三个特征显然不是癌症的良好预测因子,因此我们的模型将无法学习这些特征和标签之间有意义的关系。
The first reason is easy to illustrate in the case of one-dimensional regression: the dataset can resemble a curved line, but our model is a straight line. The second reason can be illustrated like this: let’s say you want to predict whether a patient has cancer, and the features you have are height, blood pressure, and heart rate. These three features are clearly not good predictors for cancer so our model will not be able to learn a meaningful relationship between these features and the label.
欠拟合问题的解决方案是尝试更复杂的模型或设计具有更高预测能力的特征。
The solution to the problem of underfitting is to try a more complex model or to engineer features with higher predictive power.
过度拟合是模型可能出现的另一个问题。过拟合的模型对训练数据的预测效果很好,但对来自两个保留集之一的数据的预测效果很差。我已经在第 3 章中对过拟合进行了说明。导致过拟合的原因有几个,其中最重要的是:
Overfitting is another problem a model can exhibit. The model that overfits predicts very well the training data but poorly the data from at least one of the two holdout sets. I already gave an illustration of overfitting in Chapter 3. Several reasons can lead to overfitting, the most important of which are:
在文献中,你可以找到过拟合问题的另一个名称:高方差问题。这个术语来自统计。方差是模型的误差,因为它对训练集中的小波动很敏感。这意味着如果您的训练数据采样方式不同,学习将产生显着不同的模型。这就是为什么过度拟合的模型在测试数据上表现不佳的原因:测试数据和训练数据是从数据集中独立采样的。
In the literature, you can find another name for the problem of overfitting: the problem of high variance. This term comes from statistics. The variance is an error of the model due to its sensitivity to small fluctuations in the training set. It means that if your training data was sampled differently, the learning would result in a significantly different model. Which is why the model that overfits performs poorly on the test data: test and training data are sampled from the dataset independently of one another.
即使是最简单的模型(例如线性模型)也可能会过度拟合数据。当数据是高维的,但训练样本的数量相对较少时,通常会发生这种情况。事实上,当特征向量非常高维时,线性学习算法可以构建一个为大多数参数分配非零值的模型在参数向量中,试图找到所有可用特征之间非常复杂的关系,以完美预测训练示例的标签。
Even the simplest model, such as linear, can overfit the data. That usually happens when the data is high-dimensional, but the number of training examples is relatively low. In fact, when feature vectors are very high-dimensional, the linear learning algorithm can build a model that assigns non-zero values to most parameters in the parameter vector , trying to find very complex relationships between all available features to predict labels of training examples perfectly.
如此复杂的模型很可能对保留示例的标签的预测效果很差。这是因为,通过尝试完美地预测所有训练样本的标签,模型还将学习训练集的特性:训练样本特征值中的噪声、由于数据集较小而导致的采样不完美等手头决策问题外在但存在于训练集中的伪影。
Such a complex model will most likely predict poorly the labels of the holdout examples. This is because by trying to perfectly predict labels of all training examples, the model will also learn the idiosyncrasies of the training set: the noise in the values of features of the training examples, the sampling imperfection due to the small dataset size, and other artifacts extrinsic to the decision problem at hand but present in the training set.
情节如图。 18-图。 图20示出了回归模型对数据欠拟合、拟合良好和过拟合的一维数据集。
Plots in fig. 18-fig. 20 illustrate a one-dimensional dataset for which a regression model underfits, fits well and overfits the data.
过度拟合问题的解决方案有多种:
Several solutions to the problem of overfitting are possible:
正则化是最广泛使用的防止过度拟合的方法。
Regularization is the most widely used approach to prevent overfitting.
正则化是一个总括术语,它包含迫使学习算法构建不太复杂的模型的方法。在实践中,这通常会导致稍高的偏差,但会显着降低方差。这个问题在文献中被称为偏差-方差权衡。
Regularization is an umbrella term that encompasses methods that force the learning algorithm to build a less complex model. In practice, that often leads to slightly higher bias but significantly reduces the variance. This problem is known in the literature as the bias-variance tradeoff.
两种最广泛使用的正则化类型称为L1和L2 正则化。这个想法很简单。为了创建正则化模型,我们通过添加惩罚项来修改目标函数,当模型更复杂时,惩罚项的值更高。
The two most widely used types of regularization are called L1 and L2 regularization. The idea is quite simple. To create a regularized model, we modify the objective function by adding a penalizing term whose value is higher when the model is more complex.
为简单起见,我使用线性回归的示例来说明正则化。相同的原理可以应用于多种模型。
For simplicity, I illustrate regularization using the example of linear regression. The same principle can be applied to a wide variety of models.
回想一下线性回归目标:
Recall the linear regression objective:
L1 正则化目标如下所示:
An L1-regularized objective looks like this:
在哪里和是控制正则化重要性的超参数。如果我们设置为零,该模型成为标准的非正则化线性回归模型。另一方面,如果我们设置为到一个高值,学习算法将尝试设置最多到一个非常小的值或零以最小化目标,并且模型将变得非常简单,这可能导致欠拟合。作为数据分析师,您的角色是找到超参数的值这不会过多地增加偏差,但会将方差降低到当前问题的合理水平。在下一节中,我将展示如何做到这一点。
where and is a hyperparameter that controls the importance of regularization. If we set to zero, the model becomes a standard non-regularized linear regression model. On the other hand, if we set to to a high value, the learning algorithm will try to set most to a very small value or zero to minimize the objective, and the model will become very simple which can lead to underfitting. Your role as the data analyst is to find such a value of the hyperparameter that doesn’t increase the bias too much but reduces the variance to a level reasonable for the problem at hand. In the next section, I will show how to do that.
L2 正则化目标如下所示:
An L2-regularized objective looks like this:
在哪里
where
实际上,L1 正则化会生成一个稀疏模型,该模型具有大部分参数(对于线性模型,大部分参数) 等于 0,前提是超参数足够大。因此,L1通过决定哪些特征对预测至关重要、哪些不是,来执行特征选择。如果您想提高模型的可解释性,这可能很有用。但是,如果您的唯一目标是最大限度地提高模型在保留数据上的性能,那么 L2 通常会给出更好的结果。 L2还具有可微分的优点,因此可以使用梯度下降来优化目标函数。
In practice, L1 regularization produces a sparse model, a model that has most of its parameters (in case of linear models, most of ) equal to zero, provided the hyperparameter is large enough. So L1 performs feature selection by deciding which features are essential for prediction and which are not. That can be useful in case you want to increase model explainability. However, if your only goal is to maximize the performance of the model on the holdout data, then L2 usually gives better results. L2 also has the advantage of being differentiable, so gradient descent can be used for optimizing the objective function.
L1 和 L2 正则化方法也被组合在所谓的弹性网络正则化中,其中 L1 和 L2 正则化是特殊情况。您可以在文献中找到L2 的名称为岭正则化,L1 的名称为套索。
L1 and L2 regularization methods were also combined in what is called elastic net regularization with L1 and L2 regularizations being special cases. You can find in the literature the name ridge regularization for L2 and lasso for L1.
除了广泛用于线性模型之外,L1 和 L2 正则化也经常用于神经网络和许多其他类型的模型,它们直接最小化目标函数。
In addition to being widely used with linear models, L1 and L2 regularization are also frequently used with neural networks and many other types of models, which directly minimize an objective function.
神经网络还受益于另外两种正则化技术:dropout和batch-normalization。还有一些具有正则化效果的非数学方法:数据增强和早期停止。我们将在第 8 章中讨论这些技术。
Neural networks also benefit from two other regularization techniques: dropout and batch-normalization. There are also non-mathematical methods that have a regularization effect: data augmentation and early stopping. We talk about these techniques in Chapter 8.
一旦你有了我们的学习算法使用训练集构建的模型,你怎么能说这个模型有多好呢?您使用测试集来评估模型。
Once you have a model which our learning algorithm has built using the training set, how can you say how good the model is? You use the test set to assess the model.
测试集包含学习算法以前从未见过的示例,因此如果我们的模型在预测测试集中示例的标签方面表现良好,我们就说我们的模型概括得很好,或者简单地说,它很好。
The test set contains the examples that the learning algorithm has never seen before, so if our model performs well on predicting the labels of the examples from the test set, we say that our model generalizes well or, simply, that it’s good.
为了更加严格,机器学习专家使用各种正式的指标和工具来评估模型的性能。对于回归,模型的评估非常简单。拟合良好的回归模型会产生接近观测数据值的预测值。如果没有信息特征,通常会使用平均模型,它总是预测训练数据中标签的平均值。因此,正在评估的回归模型的拟合应该优于平均模型的拟合。如果是这种情况,那么下一步就是比较模型在训练数据和测试数据上的性能。
To be more rigorous, machine learning specialists use various formal metrics and tools to assess the model performance. For regression, the assessment of the model is quite simple. A well-fitting regression model results in predicted values close to the observed data values. The mean model, which always predicts the average of the labels in the training data, generally would be used if there were no informative features. The fit of a regression model being assessed should, therefore, be better than the fit of the mean model. If this is the case, then the next step is to compare the performances of the model on the training and the test data.
为此,我们分别计算训练数据和测试数据的均方误差3 (MSE)。如果模型在测试数据上的 MSE大大高于在训练数据上获得的 MSE,则这是过度拟合的迹象。正则化或更好的超参数调整可以解决这个问题。 “显着更高”的含义取决于当前的问题,并且必须由数据分析师与订购模型的决策者/产品负责人共同决定。
To do that, we compute the mean squared error3 (MSE) for the training, and, separately, for the test data. If the MSE of the model on the test data is substantially higher than the MSE obtained on the training data, this is a sign of overfitting. Regularization or a better hyperparameter tuning could solve the problem. The meaning of “substantially higher” depends on the problem at hand and has to be decided by the data analyst jointly with the decision maker/product owner who ordered the model.
对于分类来说,事情有点复杂。用于评估分类模型的最广泛使用的指标和工具是:
For classification, things are a little bit more complicated. The most widely used metrics and tools to assess the classification model are:
为了简化说明,我使用二元分类问题。如有必要,我将展示如何将该方法扩展到多类案例。
To simplify the illustration, I use a binary classification problem. Where necessary, I show how to extend the approach to the multiclass case.
混淆矩阵是一个表格,总结了分类模型在预测属于不同类别的示例方面的成功程度。混淆矩阵的一个轴是模型预测的标签,另一轴是实际标签。在二元分类问题中,有两个类。假设该模型预测两类:“spam”和“not_spam”:
The confusion matrix is a table that summarizes how successful the classification model is at predicting examples belonging to various classes. One axis of the confusion matrix is the label that the model predicted, and the other axis is the actual label. In a binary classification problem, there are two classes. Let’s say, the model predicts two classes: “spam” and “not_spam”:
| 垃圾邮件(预测) | not_spam(预测) | |
|---|---|---|
| 垃圾邮件(实际) | 23(TP) | 1(前线) |
| not_spam(实际) | 12(FP) | 556(田纳西州) |
上述混淆矩阵显示,在 24 个实际上是垃圾邮件的示例中,模型正确分类作为垃圾邮件。在这种情况下,我们说我们有 真阳性或 TP =。模型分类错误例如 not_spam。在这种情况下,我们有 假阴性,或 FN =。同样,的实际上不是垃圾邮件的示例,被正确分类( 真阴性或 TN =), 和被错误分类( 误报, FP =)。
The above confusion matrix shows that of the 24 examples that actually were spam, the model correctly classified as spam. In this case, we say that we have true positives or TP = . The model incorrectly classified example as not_spam. In this case, we have false negative, or FN = . Similarly, of examples that actually were not spam, were correctly classified ( true negatives or TN = ), and were incorrectly classified ( false positives, FP = ).
多类分类的混淆矩阵具有与不同类一样多的行和列。它可以帮助您确定错误模式。例如,混淆矩阵可以揭示,经过训练来识别不同种类动物的模型往往会错误地预测“猫”而不是“豹”,或者错误地预测“老鼠”而不是“大鼠”。在这种情况下,您可以决定添加更多这些物种的标记示例,以帮助学习算法“看到”它们之间的差异。或者,您可以添加学习算法可用于构建模型的其他功能,以更好地区分这些物种。
The confusion matrix for multiclass classification has as many rows and columns as there are different classes. It can help you to determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize different species of animals tends to mistakenly predict “cat” instead of “panther,” or “mouse” instead of “rat.” In this case, you can decide to add more labeled examples of these species to help the learning algorithm to “see” the difference between them. Alternatively, you might add additional features the learning algorithm can use to build a model that would better distinguish between these species.
混淆矩阵用于计算另外两个性能指标:精度和召回率。
Confusion matrix is used to calculate two other performance metrics: precision and recall.
评估模型最常用的两个指标是精度和召回率。精度是正确的阳性预测与阳性预测总数的比率:
The two most frequently used metrics to assess the model are precision and recall. Precision is the ratio of correct positive predictions to the overall number of positive predictions:
召回率是正确的正面预测与数据集中正面示例总数的比率:
Recall is the ratio of correct positive predictions to the overall number of positive examples in the dataset:
为了理解精确度和召回率对于模型评估的意义和重要性,将预测问题视为使用查询研究数据库中的文档的问题通常是有用的。精度是相关文档在所有返回文档列表中所占的比例。查全率是搜索引擎返回的相关文档数与可能返回的相关文档总数的比值。
To understand the meaning and importance of precision and recall for the model assessment it is often useful to think about the prediction problem as the problem of research of documents in the database using a query. The precision is the proportion of relevant documents in the list of all returned documents. The recall is the ratio of the relevant documents returned by the search engine to the total number of the relevant documents that could have been returned.
在垃圾邮件检测问题的情况下,我们希望具有高精度(我们希望通过检测合法邮件是垃圾邮件来避免犯错误),并且我们准备容忍较低的召回率(我们容忍收件箱中的一些垃圾邮件)。
In the case of the spam detection problem, we want to have high precision (we want to avoid making mistakes by detecting that a legitimate message is spam) and we are ready to tolerate lower recall (we tolerate some spam messages in our inbox).
在实践中,我们几乎总是必须在高精度和高召回率之间做出选择。通常不可能两者兼得。我们可以通过多种方式实现两者中的任何一个:
Almost always, in practice, we have to choose between a high precision or a high recall. It’s usually impossible to have both. We can achieve either of the two by various means:
即使针对二元分类情况定义了精度和召回率,您也始终可以使用它来评估多类分类模型。为此,首先选择您想要评估这些指标的类别。然后,您将所选类别的所有示例视为正例,将其余类别的所有示例视为负例。
Even if precision and recall are defined for the binary classification case, you can always use it to assess a multiclass classification model. To do that, first select a class for which you want to assess these metrics. Then you consider all examples of the selected class as positives and all examples of the remaining classes as negatives.
准确度由正确分类示例的数量除以分类示例的总数得出。就混淆矩阵而言,它由下式给出:
Accuracy is given by the number of correctly classified examples divided by the total number of classified examples. In terms of the confusion matrix, it is given by:
当预测所有类别的错误同样重要时,准确性是一个有用的指标。如果是垃圾邮件/非垃圾邮件,情况可能并非如此。例如,您对误报的容忍程度要低于对漏报的容忍程度。垃圾邮件检测中的误报是指您的朋友向您发送电子邮件,但模型将其标记为垃圾邮件并且不向您显示的情况。另一方面,误报并不是什么大问题:如果您的模型没有检测到一小部分垃圾邮件,那也没什么大不了的。
Accuracy is a useful metric when errors in predicting all classes are equally important. In case of the spam/not spam, this may not be the case. For example, you would tolerate false positives less than false negatives. A false positive in spam detection is the situation in which your friend sends you an email, but the model labels it as spam and doesn’t show you. On the other hand, the false negative is less of a problem: if your model doesn’t detect a small percentage of spam messages, it’s not a big deal.
为了处理不同类别具有不同重要性的情况,一个有用的指标是成本敏感的准确性。要计算成本敏感的准确度,首先为两种类型的错误分配成本(正数):FP 和 FN。然后,您像往常一样计算计数 TP、TN、FP、FN,并将 FP 和 FN 的计数乘以相应的成本,然后再使用等式计算精度。 19 .
For dealing with the situation in which different classes have different importance, a useful metric is cost-sensitive accuracy. To compute a cost-sensitive accuracy, you first assign a cost (a positive number) to both types of mistakes: FP and FN. You then compute the counts TP, TN, FP, FN as usual and multiply the counts for FP and FN by the corresponding cost before calculating the accuracy using eq. 19.
ROC 曲线(代表“接收器工作特性”;该术语来自雷达工程)是评估分类模型性能的常用方法。 ROC 曲线结合使用真阳性率(准确定义为召回率)和假阳性率(错误预测的负例的比例)来构建分类性能的摘要图。
The ROC curve (stands for “receiver operating characteristic;” the term comes from radar engineering) is a commonly used method to assess the performance of classification models. ROC curves use a combination of the true positive rate (defined exactly as recall) and false positive rate (the proportion of negative examples predicted incorrectly) to build up a summary picture of the classification performance.
真阳性率(TPR)和假阳性率(FPR)分别定义为:
The true positive rate (TPR) and the false positive rate (FPR) are respectively defined as,
和
and
ROC 曲线只能用于评估返回一些预测置信度得分(或概率)的分类器。例如,逻辑回归、神经网络和决策树(以及基于决策树的集成模型)可以使用 ROC 曲线进行评估。
ROC curves can only be used to assess classifiers that return some confidence score (or a probability) of prediction. For example, logistic regression, neural networks, and decision trees (and ensemble models based on decision trees) can be assessed using ROC curves.
要绘制 ROC 曲线,首先要离散化置信度分数的范围。如果模型的这个范围是,那么你可以像这样离散它:。然后,您使用每个离散值作为预测阈值,并使用模型和此阈值预测数据集中示例的标签。例如,如果您要计算阈值的 TPR 和 FPR 等于,将模型应用于每个示例,获取分数,如果分数高于或等于,你预测正类;否则,您将预测负类。
To draw a ROC curve, you first discretize the range of the confidence score. If this range for a model is , then you can discretize it like this: . Then, you use each discrete value as the prediction threshold and predict the labels of examples in your dataset using the model and this threshold. For example, if you want to compute TPR and FPR for the threshold equal to , you apply the model to each example, get the score, and, if the score is higher than or equal to , you predict the positive class; otherwise, you predict the negative class.
请看图 1 中的插图。 21 .很容易看出,如果阈值是,我们所有的预测都是正的,所以 TPR 和 FPR 都是(右上角)。另一方面,如果阈值是,则不会做出正向预测,TPR 和 FPR 都会对应于左下角。
Look at the illustration in fig. 21. It’s easy to see that if the threshold is , all our predictions will be positive, so both TPR and FPR will be (the upper right corner). On the other hand, if the threshold is , then no positive prediction will be made, both TPR and FPR will be which corresponds to the lower left corner.
ROC 曲线下面积(AUC)越高,分类器越好。 AUC 高于的分类器比随机分类器更好。如果 AUC 低于,那么你的模型有问题。一个完美的分类器的 AUC 为。通常,如果您的模型表现良好,您可以通过选择使 TPR 接近的阈值来获得良好的分类器同时保持 FPR 接近。
The higher the area under the ROC curve (AUC), the better the classifier. A classifier with an AUC higher than is better than a random classifier. If AUC is lower than , then something is wrong with your model. A perfect classifier would have an AUC of . Usually, if your model behaves well, you obtain a good classifier by selecting the value of the threshold that gives TPR close to while keeping FPR near .
ROC 曲线之所以受欢迎,是因为它们相对简单易懂,它们捕获了分类的多个方面(通过考虑误报和漏报),并且可以直观地轻松比较不同模型的性能。
ROC curves are popular because they are relatively simple to understand, they capture more than one aspect of the classification (by taking both false positives and negatives into account) and allow visually and with low effort comparing the performance of different models.
当我介绍学习算法时,我提到作为数据分析师,你必须为算法的超参数选择好的值,例如和对于ID3,对于支持向量机,或者用于梯度下降。但这到底是什么意思呢?哪个值最好以及如何找到它?在本节中,我将回答这些基本问题。
When I presented learning algorithms, I mentioned that you as a data analyst have to select good values for the algorithm’s hyperparameters, such as and for ID3, for SVM, or for gradient descent. But what does that exactly mean? Which value is the best and how to find it? In this section, I answer these essential questions.
如您所知,超参数不会由学习算法本身进行优化。数据分析师必须通过实验找到最佳值组合(每个超参数一个)来“调整”超参数。
As you already know, hyperparameters aren’t optimized by the learning algorithm itself. The data analyst has to “tune” hyperparameters by experimentally finding the best combination of values, one per hyperparameter.
当您有足够的数据来拥有一个像样的验证集(其中每个类至少由几十个示例表示)并且超参数的数量及其范围不太大时,一种典型的方法是使用网格搜索。
One typical way to do that, when you have enough data to have a decent validation set (in which each class is represented by at least a couple of dozen examples) and the number of hyperparameters and their range is not too large is to use grid search.
网格搜索是最简单的超参数调整技术。假设您训练一个 SVM,并且有两个超参数需要调整:惩罚参数(正实数)和内核(“线性”或“rbf”)。
Grid search is the most simple hyperparameter tuning technique. Let’s say you train an SVM and you have two hyperparameters to tune: the penalty parameter (a positive real number) and the kernel (either “linear” or “rbf”).
如果这是您第一次使用这个特定的数据集,您不知道可能的值范围是多少。最常见的技巧是使用对数刻度。例如,对于您可以尝试以下值:[0.001, 0.01, 0.1, 1, 10, 100, 1000]。在这种情况下,您有 14 种超参数组合可供尝试:[(0.001, “线性”), (0.01, “线性”), (0.1, “线性”), (1, “线性”), (10, “线性”) ”), (100, “线性”), (1000, “线性”), (0.001, “rbf”), (0.01, “rbf”), (0.1, “rbf”), (1, “rbf”) ,(10,“rbf”),(100,“rbf”),(1000,“rbf”)]。
If it’s the first time you are working with this particular dataset, you don’t know what is the possible range of values for . The most common trick is to use a logarithmic scale. For example, for you can try the following values: [0.001, 0.01, 0.1, 1, 10, 100, 1000]. In this case you have 14 combinations of hyperparameters to try: [(0.001, “linear”), (0.01, “linear”), (0.1, “linear”), (1, “linear”), (10, “linear”), (100, “linear”), (1000, “linear”), (0.001, “rbf”), (0.01, “rbf”), (0.1, “rbf”), (1, “rbf”), (10, “rbf”), (100, “rbf”), (1000, “rbf”)].
您使用训练集并训练 14 个模型,每个模型对应超参数的一种组合。然后,您使用我们在上一节中讨论的指标之一(或对您重要的其他一些指标)评估每个模型在验证数据上的性能。最后,您保留根据指标表现最佳的模型。
You use the training set and train 14 models, one for each combination of hyperparameters. Then you assess the performance of each model on the validation data using one of the metrics we discussed in the previous section (or some other metric that matters to you). Finally, you keep the model that performs the best according to the metric.
一旦找到最佳的超参数对,您就可以尝试探索其周围某些区域中接近最佳值的值。有时,这可以产生更好的模型。
Once the best pair of hyperparameters is found, you can try to explore the values close to the best ones in some region around them. Sometimes, this can result in an even better model.
最后,您使用测试集评估所选模型。
Finally, you assess the selected model using the test set.
您可能会注意到,尝试所有超参数组合(尤其是多个超参数组合)可能非常耗时,尤其是对于大型数据集。还有更有效的技术,例如随机搜索和贝叶斯超参数优化。
As you could notice, trying all combinations of hyperparameters, especially if there are more than a couple of them, could be time-consuming, especially for large datasets. There are more efficient techniques, such as random search and Bayesian hyperparameter optimization.
随机搜索与网格搜索的不同之处在于,您不再提供一组离散的值来探索每个超参数;相反,您为每个超参数提供一个统计分布,从中随机采样值并设置要尝试的组合总数。
Random search differs from grid search in that you no longer provide a discrete set of values to explore for each hyperparameter; instead, you provide a statistical distribution for each hyperparameter from which values are randomly sampled and set the total number of combinations you want to try.
贝叶斯技术与随机或网格搜索的不同之处在于,贝叶斯技术使用过去的评估结果来选择下一个要评估的值。这个想法是通过基于过去表现良好的超参数值来选择下一个超参数值,从而限制目标函数昂贵的优化次数。
Bayesian techniques differ from random or grid search in that they use past evaluation results to choose the next values to evaluate. The idea is to limit the number of expensive optimizations of the objective function by choosing the next hyperparameter values based on those that have done well in the past.
还有基于梯度的技术、进化优化技术和其他算法超参数调整技术。大多数现代机器学习库都实现了一种或多种此类技术。还有超参数调整库,可以帮助您调整几乎任何学习算法的超参数,包括您自己编程的算法。
There are also gradient-based techniques, evolutionary optimization techniques, and other algorithmic hyperparameter tuning techniques. Most modern machine learning libraries implement one or more such techniques. There are also hyperparameter tuning libraries that can help you to tune hyperparameters of virtually any learning algorithm, including ones you programmed yourself.
当您没有合适的验证集来调整超参数时,可以帮助您的常用技术称为交叉验证。当训练样本很少时,同时拥有验证集和测试集可能会令人望而却步。您更愿意使用更多数据来训练模型。在这种情况下,您只需将数据分为训练集和测试集。然后,您对训练集使用交叉验证来模拟验证集。
When you don’t have a decent validation set to tune your hyperparameters on, the common technique that can help you is called cross-validation. When you have few training examples, it could be prohibitive to have both validation and test set. You would prefer to use more data to train the model. In such a case, you only split your data into a training and a test set. Then you use cross-validation on the training set to simulate a validation set.
交叉验证的工作原理如下。首先,修复要评估的超参数的值。然后将训练集分成几个相同大小的子集。每个子集称为一个折叠。通常,实践中使用五折交叉验证。通过五折交叉验证,您可以将训练数据随机分为五折:。每个,包含 20% 的训练数据。然后按如下方式训练五个模型。为了训练第一个模型,,您使用折叠中的所有示例,,, 和作为训练集和示例作为验证集。为了训练第二个模型,,您使用折叠中的示例,,, 和训练和示例作为验证集。您继续像这样迭代地构建模型,并计算每个验证集上感兴趣的指标的值,从到。然后,对指标的五个值进行平均以获得最终值。
Cross-validation works as follows. First, you fix the values of the hyperparameters you want to evaluate. Then you split your training set into several subsets of the same size. Each subset is called a fold. Typically, five-fold cross-validation is used in practice. With five-fold cross-validation, you randomly split your training data into five folds: . Each , contains 20% of your training data. Then you train five models as follows. To train the first model, , you use all examples from folds , , , and as the training set and the examples from as the validation set. To train the second model, , you use the examples from folds , , , and to train and the examples from as the validation set. You continue building models iteratively like this and compute the value of the metric of interest on each validation set, from to . Then you average the five values of the metric to get the final value.
您可以使用网格搜索和交叉验证来查找模型的最佳超参数值。找到这些值后,您可以使用整个训练集来构建模型,其中包含通过交叉验证找到的超参数的最佳值。最后,您使用测试集评估模型。
You can use grid search with cross-validation to find the best values of hyperparameters for your model. Once you have found these values, you use the entire training set to build the model with these best values of hyperparameters you have found via cross-validation. Finally, you assess the model using the test set.
当某些分类变量的值的顺序很重要时,我们可以通过仅保留一个变量来用数字替换这些值。例如,如果我们的变量代表文章的质量,并且值是,然后我们可以用数字替换这些类别,例如,。↩
When the ordering of values of some categorical variable matters, we can replace those values by numbers by keeping only one variable. For example, if our variable represents the quality of an article, and the values are , then we could replace those categories by numbers, for example, .↩
现代库中实现的 kNN 和集成方法的预测速度仍然相当快。不要害怕在实践中使用这些算法。↩
The prediction speed of kNN and ensemble methods implemented in the modern libraries are still pretty fast. Don’t be afraid of using these algorithms in your practice.↩
或任何其他类型的有意义的平均损失函数。↩
Or any other type of average loss function that makes sense.↩
首先,你已经知道什么是神经网络,并且你已经知道如何构建这样的模型。是的,这就是逻辑回归!事实上,逻辑回归模型,或者更确切地说,它对多类分类的推广,称为 softmax 回归模型,是神经网络中的标准单元。
First of all, you already know what a neural network is, and you already know how to build such a model. Yes, it’s logistic regression! As a matter of fact, the logistic regression model, or rather its generalization for multiclass classification, called the softmax regression model, is a standard unit in a neural network.
如果您了解线性回归、逻辑回归和梯度下降,那么理解神经网络应该不成问题。
If you understood linear regression, logistic regression, and gradient descent, understanding neural networks should not be a problem.
神经网络 (NN),就像回归或 SVM 模型一样,是一个数学函数:
A neural network (NN), just like a regression or an SVM model, is a mathematical function:
功能有一种特殊的形式:它是一个嵌套函数。您可能已经听说过神经网络层。因此,对于返回标量的 3 层神经网络,看起来像这样:
The function has a particular form: it’s a nested function. You have probably already heard of neural network layers. So, for a 3-layer neural network that returns a scalar, looks like this:
在上式中,和是以下形式的向量函数:
In the above equation, and are vector functions of the following form:
在哪里称为图层索引,可以跨越到任意数量的层。功能称为激活函数。它是数据分析师在学习开始之前选择的固定的、通常是非线性的函数。参数(矩阵)和通过根据任务优化特定的成本函数(例如 MSE),使用熟悉的梯度下降来学习每一层的(向量)。比较等式。 20使用逻辑回归方程,将其替换为通过 sigmoid 函数,你不会看到任何差异。功能是回归任务的标量函数,但也可以是向量函数,具体取决于您的问题。
where is called the layer index and can span from to any number of layers. The function is called an activation function. It is a fixed, usually nonlinear function chosen by the data analyst before the learning is started. The parameters (a matrix) and (a vector) for each layer are learned using the familiar gradient descent by optimizing, depending on the task, a particular cost function (such as MSE). Compare eq. 20 with the equation for logistic regression, where you replace by the sigmoid function, and you will not see any difference. The function is a scalar function for the regression task, but can also be a vector function depending on your problem.
您可能想知道为什么矩阵被使用而不是向量。原因是是一个向量函数。每一行(为矩阵的单位)是与以下维度相同的向量。让。的输出是一个向量, 在哪里是某个标量函数1,并且是层中的单元数。为了使其更具体,让我们考虑一种称为多层感知器(通常称为普通神经网络)的神经网络架构。
You may probably wonder why a matrix is used and not a vector . The reason is that is a vector function. Each row ( for unit) of the matrix is a vector of the same dimensionality as . Let . The output of is a vector , where is some scalar function1, and is the number of units in layer . To make it more concrete, let’s consider one architecture of neural networks called multilayer perceptron and often referred to as a vanilla neural network.
我们仔细研究了一种称为前馈神经网络(FFNN) 的特定神经网络配置,更具体地说是称为多层感知器(MLP)的架构。作为说明,让我们考虑一个具有三层的 MLP。我们的网络采用二维特征向量作为输入并输出一个数字。该 FFNN 可以是回归模型或分类模型,具体取决于第三输出层中使用的激活函数。
We have a closer look at one particular configuration of neural networks called feed-forward neural networks (FFNN), and more specifically the architecture called a multilayer perceptron (MLP). As an illustration, let’s consider an MLP with three layers. Our network takes a two-dimensional feature vector as input and outputs a number. This FFNN can be a regression or a classification model, depending on the activation function used in the third, output layer.
我们的 MLP 如下所示。
Our MLP is depicted below.
神经网络以图形方式表示为逻辑上组织为一层或多层的单元的连接组合。每个单元由圆形或矩形表示。入站箭头表示单元的输入并指示该输入的来源。出站箭头表示单元的输出。
The neural network is represented graphically as a connected combination of units logically organized into one or more layers. Each unit is represented by either a circle or a rectangle. The inbound arrow represents an input of a unit and indicates where this input came from. The outbound arrow indicates the output of a unit.
每个单元的输出是写在矩形内的数学运算的结果。圆形单位不会对输入执行任何操作;他们只是将输入直接发送到输出。
The output of each unit is the result of the mathematical operation written inside the rectangle. Circle units don’t do anything with the input; they just send their input directly to the output.
每个矩形单元中都会发生以下情况。首先,将单元的所有输入连接在一起以形成输入向量。然后,该单元对输入向量应用线性变换,就像线性回归模型对其输入特征向量所做的那样。最后,该单元应用激活函数线性变换的结果并获得输出值,一个实数。在普通 FFNN 中,某层单元的输出值成为后续层每个单元的输入值。
The following happens in each rectangle unit. Firstly, all inputs of the unit are joined together to form an input vector. Then the unit applies a linear transformation to the input vector, exactly like linear regression model does with its input feature vector. Finally, the unit applies an activation function to the result of the linear transformation and obtains the output value, a real number. In a vanilla FFNN, the output value of a unit of some layer becomes an input value of each of the units of the subsequent layer.
在图中。 22、激活函数有一个索引:,单元所属层的索引。通常,一层的所有单元都使用相同的激活函数,但这不是规则。每层可以有不同数量的单元。每个单元都有其参数和, 在哪里是单位的索引,并且是图层的索引。向量每个单元中的定义为。向量在第一层中定义为。
In fig. 22, the activation function has one index: , the index of the layer the unit belongs to. Usually, all units of a layer use the same activation function, but it’s not a rule. Each layer can have a different number of units. Each unit has its parameters and , where is the index of the unit, and is the index of the layer. The vector in each unit is defined as . The vector in the first layer is defined as .
如图所示。 如图22所示,在多层感知器中,一层的所有输出都连接到后续层的每个输入。这种架构称为全连接。神经网络可以包含全连接层。这些层的单元接收前一层每个单元的输出作为输入。
As you can see in fig. 22, in multilayer perceptron all outputs of one layer are connected to each input of the succeeding layer. This architecture is called fully-connected. A neural network can contain fully-connected layers. Those are the layers whose units receive as inputs the outputs of each of the units of the previous layer.
如果我们想要解决前面章节中讨论的回归或分类问题,神经网络的最后(最右边)层通常只包含一个单元。如果激活函数最后一个单元是线性的,则神经网络是回归模型。如果是逻辑函数,神经网络是二元分类模型。
If we want to solve a regression or a classification problem discussed in previous chapters, the last (the rightmost) layer of a neural network usually contains only one unit. If the activation function of the last unit is linear, then the neural network is a regression model. If the is a logistic function, the neural network is a binary classification model.
数据分析师可以选择任何数学函数作为,假设它是可微的2。后一个属性对于用于查找参数值的梯度下降至关重要和对全部和。函数中具有非线性分量的主要目的就是让神经网络去逼近非线性函数。没有非线性,无论有多少层,都是线性的。原因是是线性函数,线性函数的线性函数也是线性的。
The data analyst can choose any mathematical function as , assuming it’s differentiable2. The latter property is essential for gradient descent used to find the values of the parameters and for all and . The primary purpose of having nonlinear components in the function is to allow the neural network to approximate nonlinear functions. Without nonlinearities, would be linear, no matter how many layers it has. The reason is that is a linear function and a linear function of a linear function is also linear.
激活函数的流行选择是您已知的逻辑函数,以及TanH和ReLU。前者是双曲正切函数,类似于逻辑函数,但范围为到(没有到达他们)。后者是修正的线性单位函数,当其输入为零时是负数并且否则:
Popular choices of activation functions are the logistic function, already known to you, as well as TanH and ReLU. The former is the hyperbolic tangent function, similar to the logistic function but ranging from to (without reaching them). The latter is the rectified linear unit function, which equals to zero when its input is negative and to otherwise:
正如我上面所说,在表达式中,是一个矩阵,而是一个向量。这看起来与线性回归不同。在矩阵中, 每一行对应于参数向量。向量的维数等于层中的单元数。操作结果为向量。然后求和给出一个维向量。最后,函数产生向量作为输出。
As I said above, in the expression , is a matrix, while is a vector. That looks different from linear regression’s . In matrix , each row corresponds to a vector of parameters . The dimensionality of the vector equals to the number of units in the layer . The operation results in a vector . Then the sum gives a -dimensional vector . Finally, the function produces the vector as output.
深度学习是指训练具有两个以上非输出层的神经网络。过去,随着层数的增加,训练此类网络变得更加困难。使用梯度下降来训练网络参数时,两个最大的挑战被称为梯度爆炸和梯度消失问题。
Deep learning refers to training neural networks with more than two non-output layers. In the past, it became more difficult to train such networks as the number of layers grew. The two biggest challenges were referred to as the problems of exploding gradient and vanishing gradient as gradient descent was used to train the network parameters.
虽然通过应用梯度裁剪和 L1 或 L2 正则化等简单技术可以更轻松地处理梯度爆炸问题,但梯度消失问题几十年来仍然难以解决。
While the problem of exploding gradient was easier to deal with by applying simple techniques like gradient clipping and L1 or L2 regularization, the problem of vanishing gradient remained intractable for decades.
什么是梯度消失以及为什么会出现梯度消失?为了更新神经网络中的参数值,通常使用称为反向传播的算法。反向传播是一种使用链式法则计算神经网络梯度的有效算法。在第 4 章中,我们已经了解了如何使用链式法则来计算复函数的偏导数。在梯度下降期间,神经网络的参数接收与每次训练迭代中成本函数相对于当前参数的偏导数成比例的更新。问题在于,在某些情况下,梯度会非常小,从而有效地阻止了某些参数改变其值。在最坏的情况下,这可能会完全阻止神经网络进一步训练。
What is vanishing gradient and why does it arise? To update the values of the parameters in neural networks the algorithm called backpropagation is typically used. Backpropagation is an efficient algorithm for computing gradients on neural networks using the chain rule. In Chapter 4, we have already seen how the chain rule is used to calculate partial derivatives of a complex function. During gradient descent, the neural network’s parameters receive an update proportional to the partial derivative of the cost function with respect to the current parameter in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing some parameters from changing their value. In the worst case, this may completely stop the neural network from further training.
传统的激活函数,例如我上面提到的双曲正切函数,其梯度范围为,反向传播通过链式法则计算梯度。这样就有倍增的效果这些小数字的值来计算早期(最左边)层的梯度-层网络,这意味着梯度呈指数下降。这会导致较早的层训练非常缓慢(如果有的话)。
Traditional activation functions, such as the hyperbolic tangent function I mentioned above, have gradients in the range , and backpropagation computes gradients by the chain rule. That has the effect of multiplying of these small numbers to compute gradients of the earlier (leftmost) layers in an -layer network, meaning that the gradient decreases exponentially with . That results in the effect that the earlier layers train very slowly, if at all.
然而,神经网络学习算法的现代实现允许您有效地训练非常深的神经网络(最多数百层)。这是由于多项改进的结合,包括 ReLU、LSTM(以及其他门控单元;我们在下面考虑它们),以及残差神经网络中使用的跳跃连接等技术,以及梯度下降算法的高级修改。
However, the modern implementations of neural network learning algorithms allow you to effectively train very deep neural networks (up to hundreds of layers). This is due to several improvements combined together, including ReLU, LSTM (and other gated units; we consider them below), as well as techniques such as skip connections used in residual neural networks, as well as advanced modifications of the gradient descent algorithm.
因此,今天,由于梯度消失和爆炸的问题在很大程度上得到了解决(或者其影响减弱),“深度学习”一词是指使用现代算法和数学工具包来训练神经网络,而与神经网络的深度无关。网络群岛。在实践中,许多业务问题可以通过输入层和输出层之间有 2-3 层的神经网络来解决。既不是输入也不是输出的层通常称为隐藏层。
Therefore, today, since the problems of vanishing and exploding gradient are mostly solved (or their effect diminished) to a great extent, the term “deep learning” refers to training neural networks using the modern algorithmic and mathematical toolkit independently of how deep the neural network is. In practice, many business problems can be solved with neural networks having 2-3 layers between the input and output layers. The layers that are neither input nor output are often called hidden layers.
您可能已经注意到,随着网络规模的扩大,MLP 的参数数量会增长得非常快。更具体地说,当您添加一层时,您会添加参数(我们的矩阵加上向量)。这意味着,如果您向现有神经网络添加另一个 1000 个单元的层,那么您就会向模型添加超过 100 万个额外参数。优化如此大的模型是一个计算量非常大的问题。
You may have noticed that the number of parameters an MLP can have grows very fast as you make your network bigger. More specifically, as you add one layer, you add parameters (our matrix plus the vector ). That means that if you add another 1000-unit layer to an existing neural network, then you add more than 1 million additional parameters to your model. Optimizing such big models is a very computationally intensive problem.
当我们的训练样本是图像时,输入的维度非常高3。如果你想学习使用 MLP 对图像进行分类,优化问题可能会变得棘手。
When our training examples are images, the input is very high-dimensional3. If you want to learn to classify images using an MLP, the optimization problem is likely to become intractable.
卷积神经网络(CNN)是一种特殊的 FFNN,它显着减少了具有许多单元的深度神经网络中的参数数量,而不会损失太多模型的质量。 CNN 已在图像和文本处理领域得到应用,并且超越了许多先前建立的基准。
A convolutional neural network (CNN) is a special kind of FFNN that significantly reduces the number of parameters in a deep neural network with many units without losing too much in the quality of the model. CNNs have found applications in image and text processing where they beat many previously established benchmarks.
由于 CNN 是在考虑图像处理的情况下发明的,因此我在图像分类示例中对其进行了解释。
Because CNNs were invented with image processing in mind, I explain them on the image classification example.
您可能已经注意到,在图像中,彼此接近的像素通常代表相同类型的信息:天空、水、树叶、毛皮、砖块等。该规则的例外是边缘:图像中两个不同对象相互“接触”的部分。
You may have noticed that in images, pixels that are close to one another usually represent the same type of information: sky, water, leaves, fur, bricks, and so on. The exception from the rule are the edges: the parts of an image where two different objects “touch” one another.
如果我们可以训练神经网络来识别具有相同信息的区域以及边缘,那么这些知识将允许神经网络预测图像中表示的对象。例如,如果神经网络检测到多个皮肤区域和边缘看起来像椭圆形的一部分,内部色调类似皮肤,外部色调偏蓝色,那么它很可能是天空背景上的一张脸。如果我们的目标是检测图片上的人,神经网络很可能会成功预测图片中的人。
If we can train the neural network to recognize regions of the same information as well as the edges, this knowledge would allow the neural network to predict the object represented in the image. For example, if the neural network detected multiple skin regions and edges that look like parts of an oval with skin-like tone on the inside and bluish tone on the outside, then it is likely that it’s a face on the sky background. If our goal is to detect people on pictures, the neural network will most likely succeed in predicting a person in this picture.
考虑到图像中最重要的信息是局部的,我们可以使用移动窗口方法将图像分割成方形块4。然后,我们可以一次训练多个较小的回归模型,每个小回归模型接收一个正方形补丁作为输入。每个小型回归模型的目标是学习检测输入补丁中的特定类型模式。例如,一个小型回归模型将学习检测天空;另一个将检测草地,第三个将检测建筑物的边缘,依此类推。
Having in mind that the most important information in the image is local, we can split the image into square patches using a moving window approach4. We can then train multiple smaller regression models at once, each small regression model receiving a square patch as input. The goal of each small regression model is to learn to detect a specific kind of pattern in the input patch. For example, one small regression model will learn to detect the sky; another one will detect the grass, the third one will detect edges of a building, and so on.
在 CNN 中,小型回归模型如图 1 所示。 22,但它只有一层并且没有层次和。为了检测某些模式,小型回归模型必须学习矩阵的参数(“过滤器”)尺寸, 在哪里是补丁的大小。为简单起见,我们假设输入图像是黑白的,代表黑色和代表白色像素。还假设我们的补丁是经过像素 ()。某些补丁可能类似于以下矩阵(对于“补丁”):
In CNNs, a small regression model looks like the one in fig. 22, but it only has the layer and doesn’t have layers and . To detect some pattern, a small regression model has to learn the parameters of a matrix (for “filter”) of size , where is the size of a patch. Let’s assume, for simplicity, that the input image is black and white, with representing black and representing white pixels. Assume also that our patches are by pixels (). Some patch could then look like the following matrix (for “patch”):
上面的补丁代表了一个看起来像十字的图案。将检测此类模式(并且仅检测它们)的小型回归模型需要学习经过参数矩阵其中参数对应的位置输入补丁中的 s 将为正数,而对应位置的参数s 将接近于零。如果我们计算矩阵的卷积和,越相似我们获得的值就越高是为了。为了说明两个矩阵的卷积,假设看起来像这样:
The above patch represents a pattern that looks like a cross. The small regression model that will detect such patterns (and only them) would need to learn a by parameter matrix where parameters at positions corresponding to the s in the input patch would be positive numbers, while the parameters in positions corresponding to s would be close to zero. If we calculate the convolution of matrices and , the value we obtain is higher the more similar is to . To illustrate the convolution of two matrices, assume that looks like this:
然后运算符仅针对具有相同行数和列数的矩阵定义。对于我们的矩阵和其计算方法如下图所示:
Then operator is only defined for matrices that have the same number of rows and columns. For our matrices of and it’s calculated as illustrated below:
如果我们的输入补丁有不同的模式,例如字母 L 的模式,
If our input patch had a different patten, for example, that of a letter L,
然后与会给出较低的结果:。因此,您可以看到补丁“看起来”越像过滤器,卷积运算的值就越高。为了方便起见,还有一个偏差参数与每个过滤器相关联在应用非线性(激活函数)之前将其添加到卷积结果中。
then the convolution with would give a lower result: . So, you can see the more the patch “looks” like the filter, the higher the value of the convolution operation is. For convenience, there’s also a bias parameter associated with each filter which is added to the result of a convolution before applying the nonlinearity (activation function).
CNN 的一层由多个卷积滤波器(每个滤波器都有自己的偏差参数)组成,就像普通 FFNN 中的一层由多个单元组成一样。第一(最左边)层的每个过滤器在输入图像上从左到右、从上到下滑动(或卷积),并且在每次迭代时计算卷积。
One layer of a CNN consists of multiple convolution filters (each with its own bias parameter), just like one layer in a vanilla FFNN consists of multiple units. Each filter of the first (leftmost) layer slides — or convolves — across the input image, left to right, top to bottom, and convolution is computed at each iteration.
图 2 给出了该过程的说明。 图 24中显示了一个滤波器在图像上进行卷积的 6 个步骤。
An illustration of the process is given in fig. 24 where 6 steps of one filter convolving across an image are shown.
滤波器矩阵(每层中的每个滤波器一个)和偏差值是可训练参数,可使用梯度下降和反向传播进行优化。
The filter matrix (one for each filter in each layer) and bias values are trainable parameters that are optimized using gradient descent with backpropagation.
非线性应用于卷积和偏置项的和。通常,ReLU 激活函数用于所有隐藏层。输出层的激活函数取决于任务。
A nonlinearity is applied to the sum of the convolution and the bias term. Typically, the ReLU activation function is used in all hidden layers. The activation function of the output layer depends on the task.
既然我们可以有每层都有过滤器, 卷积层的输出将包括矩阵,每个滤波器一个。
Since we can have filters in each layer , the output of the convolution layer would consist of matrices, one for each filter.
如果 CNN 的一个卷积层后面跟着另一个卷积层,则后续层处理前一层的输出作为一个集合图像矩阵。这样的集合称为卷。该集合的大小称为卷的深度。每层过滤器对整个体积进行卷积。体积块的卷积只是该体积所包含的各个矩阵的相应块的卷积之和。
If the CNN has one convolution layer following another convolution layer, then the subsequent layer treats the output of the preceding layer as a collection of image matrices. Such a collection is called a volume. The size of that collection is called the volume’s depth. Each filter of layer convolves the whole volume. The convolution of a patch of a volume is simply the sum of convolutions of the corresponding patches of individual matrices the volume consists of.
下面,您可以看到由深度组成的体积块的卷积示例。
Below, you can see an example of a convolution of a patch of a volume consisting of depth .
卷积的值,,获得为,
The value of the convolution, , was obtained as,
。
.
在计算机视觉中,CNN 通常以体积作为输入,因为图像通常由三个通道表示:R、G 和 B,每个通道都是单色图片。
In computer vision, CNNs often get volumes as input, since an image is usually represented by three channels: R, G, and B, each channel being a monochrome picture.
卷积的两个重要属性是步幅和填充。 Stride 是移动窗口的步长。在图中。 24、步幅为,即过滤器一次向右滑至底部一个单元格。
Two important properties of convolution are stride and padding. Stride is the step size of the moving window. In fig. 24, the stride is , that is the filter slides to the right and to the bottom by one cell at a time.
在下图中,您可以看到带有步长的卷积的部分示例。可以看到,步幅越大,输出矩阵越小。
In the figure below, you can see a partial example of convolution with stride . You can see that the output matrix is smaller when stride is bigger.
Padding可以得到更大的输出矩阵;它是在与滤波器进行卷积之前围绕图像(或体积)的附加单元格的宽度。通过填充添加的单元格通常包含零。在图中。 24、填充是,因此没有额外的单元格添加到图像中。
Padding allows getting a larger output matrix; it’s the width of the square of additional cells with which you surround the image (or volume) before you convolve it with the filter. The cells added by padding usually contain zeroes. In fig. 24, the padding is , so no additional cells are added to the image.
在图中。 27,另一方面,步幅为和填充是,所以一个正方形的宽度额外的细胞被添加到图像中。您可以看到,当 padding 越大时,输出矩阵就越大5。
In fig. 27, on the other hand, the stride is and padding is , so a square of width of additional cells are added to the image. You can see that the output matrix is bigger when padding is bigger5.
带内边距的图像示例如下图所示:
An example of an image with padding is shown below:
填充对于较大的过滤器很有帮助,因为它允许它们更好地“扫描”图像的边界。
Padding is helpful with larger filters because it allows them to better “scan” the boundaries of the image.
如果不介绍池化(CNN 中经常使用的一种技术),本节就不完整。池化的工作方式与卷积非常相似,作为使用移动窗口方法应用的滤波器。然而,池化层不是将可训练的滤波器应用于输入矩阵或体积,而是应用固定的运算符,通常是或者。与卷积类似,池化也有超参数:滤波器的大小和步幅。一个例子使用大小的过滤器进行池化并大步迈进如下图所示:
This section would not be complete without presenting pooling, a technique very often used in CNNs. Pooling works in a way very similar to convolution, as a filter applied using a moving window approach. However, instead of applying a trainable filter to an input matrix or a volume, pooling layer applies a fixed operator, usually either or . Similarly to convolution, pooling has hyperparameters: the size of the filter and the stride. An example of pooling with filter of size and stride is shown below:
通常,池化层位于卷积层之后,它将卷积的输出作为输入。当池化应用于卷时,卷中的每个矩阵都会独立于其他矩阵进行处理。因此,应用于体积的池化层的输出是与输入具有相同深度的体积。
Usually, a pooling layer follows a convolution layer, and it gets the output of convolution as input. When pooling is applied to a volume, each matrix in the volume is processed independently of others. Therefore, the output of the pooling layer applied to a volume is a volume of the same depth as the input.
正如你所看到的,池化只有超参数,没有需要学习的参数。通常,过滤器的尺寸或者并大步迈进被运用到实践中。最大池化比平均池化更受欢迎,并且通常会给出更好的结果。
As you can see, pooling only has hyperparameters and doesn’t have parameters to learn. Typically, the filter of size or and stride are used in practice. Max pooling is more popular than average and often gives better results.
通常,池化有助于提高模型的准确性。它还通过减少神经网络的参数数量来提高训练速度。 (如图29所示 ,过滤器尺寸并大步迈进参数数量减少到25%,即参数而不是.)
Typically pooling contributes to the increased accuracy of the model. It also improves the speed of training by reducing the number of parameters of the neural network. (As you can see in fig. 29, with filter size and stride the number of parameters is reduced to 25%, that is to parameters instead of .)
循环神经网络(RNN) 用于标记、分类或生成序列。序列是一个矩阵,其中的每一行都是一个特征向量,并且行的顺序很重要。标记序列就是预测序列中每个特征向量的类别。对序列进行分类就是预测整个序列的类别。生成序列就是输出与输入序列某种程度相关的另一个序列(可能具有不同的长度)。
Recurrent neural networks (RNNs) are used to label, classify, or generate sequences. A sequence is a matrix, each row of which is a feature vector and the order of rows matters. To label a sequence is to predict a class for each feature vector in a sequence. To classify a sequence is to predict a class for the entire sequence. To generate a sequence is to output another sequence (of a possibly different length) somehow relevant to the input sequence.
RNN 经常用于文本处理,因为句子和文本自然是单词/标点符号序列或字符序列。出于同样的原因,循环神经网络也用于语音处理。
RNNs are often used in text processing because sentences and texts are naturally sequences of either words/punctuation marks or sequences of characters. For the same reason, recurrent neural networks are also used in speech processing.
循环神经网络不是前馈的:它包含循环。这个想法是每个单位循环层的具有实值状态 。状态可以看作是单元的内存。在RNN中,每个单元在每一层接收两个输入:来自上一层的状态向量以及同一层的状态向量从上一个时间步开始。
A recurrent neural network is not feed-forward: it contains loops. The idea is that each unit of recurrent layer has a real-valued state . The state can be seen as the memory of the unit. In RNN, each unit in each layer receives two inputs: a vector of states from the previous layer and the vector of states from this same layer from the previous time step.
为了说明这个想法,让我们考虑 RNN 的第一和第二循环层。第一(最左边)层接收特征向量作为输入。第二层接收第一层的输出作为输入。
To illustrate the idea, let’s consider the first and the second recurrent layers of an RNN. The first (leftmost) layer receives a feature vector as input. The second layer receives the output of the first layer as input.
这种情况示意性地描绘在图2中。 30以下。
This situation is schematically depicted in fig. 30 below.
正如我上面所说,每个训练示例都是一个矩阵,其中每一行都是一个特征向量。为了简单起见,我们将该矩阵说明为向量序列, 在哪里是输入序列的长度。如果我们的输入示例是一个文本句子,然后是特征向量对于每个代表句子中位置处的单词。
As I said above, each training example is a matrix in which each row is a feature vector. For simplicity, let’s illustrate this matrix as a sequence of vectors , where is the length of the input sequence. If our input example is a text sentence, then feature vector for each represents a word in the sentence at position .
如图所示。 如图30所示,在RNN中,来自输入示例的特征向量由神经网络按照时间步的顺序顺序“读取”。指数表示时间步长。更新状态在每个时间步在每个单元每层的我们首先计算输入特征向量与状态向量的线性组合来自前一个时间步的同一层,。使用两个参数向量计算两个向量的线性组合,和一个参数。的价值然后通过应用激活函数获得到线性组合的结果。功能的典型选择是。输出通常是为整个层计算的向量立刻。获得,我们使用激活函数以向量作为输入并返回相同维度的不同向量。功能应用于状态向量值的线性组合使用参数矩阵计算和一个参数向量。在分类中,典型的选择为是softmax函数:
As depicted in fig. 30, in an RNN, the feature vectors from an input example are “read” by the neural network sequentially in the order of the timesteps. The index denotes a timestep. To update the state at each timestep in each unit of each layer we first calculate a linear combination of the input feature vector with the state vector of this same layer from the previous timestep, . The linear combination of two vectors is calculated using two parameter vectors , and a parameter . The value of is then obtained by applying activation function to the result of the linear combination. A typical choice for function is . The output is typically a vector calculated for the whole layer at once. To obtain , we use activation function that takes a vector as input and returns a different vector of the same dimensionality. The function is applied to a linear combination of the state vector values calculated using a parameter matrix and a parameter vector . In classification, a typical choice for is the softmax function:
在哪里
where
softmax 函数是 sigmoid 函数对多维输出的推广。它具有这样的属性和对全部。
The softmax function is a generalization of the sigmoid function to multidimensional outputs. It has the property that and for all .
维数为由数据分析师选择,使得矩阵的乘法由向量结果产生与向量具有相同维数的向量。此选择取决于输出标签的维度在你的训练数据中。 (到目前为止,我们只看到了一维标签,但我们将在以后的章节中看到标签也可以是多维的。)
The dimensionality of is chosen by the data analyst such that multiplication of matrix by the vector results in a vector of the same dimensionality as that of the vector . This choice depends on the dimensionality for the output label in your training data. (Until now we only saw one-dimensional labels, but we will see in the future chapters that labels can be multidimensional as well.)
的价值观,,,, 和是使用梯度下降和反向传播从训练数据计算出来的。为了训练 RNN 模型,使用了一种特殊版本的反向传播,称为随时间反向传播。
The values of , , , , and are computed from the training data using gradient descent with backpropagation. To train RNN models, a special version of backpropagation is used called backpropagation through time.
tanh和softmax都面临着梯度消失问题。即使我们的 RNN 只有一两个循环层,由于输入的顺序性质,反向传播也必须随着时间的推移“展开”网络。从梯度计算的角度来看,实际上这意味着输入序列越长,展开的网络越深。
Both tanh and softmax suffer from the vanishing gradient problem. Even if our RNN has just one or two recurrent layers, because of the sequential nature of the input, backpropagation has to “unfold” the network over time. From the point of view of the gradient calculation, in practice this means that the longer is the input sequence, the deeper is the unfolded network.
RNN 的另一个问题是处理长期依赖关系。随着输入序列长度的增长,序列开头的特征向量往往会被“遗忘”,因为充当网络内存的每个单元的状态会受到最近读取的特征向量的显着影响。因此,在文本或语音处理中,长句子中相距较远的单词之间的因果联系可能会丢失。
Another problem RNNs have is that of handling long-term dependencies. As the length of the input sequence grows, the feature vectors from the beginning of the sequence tend to be “forgotten,” because the state of each unit, which serves as network’s memory, becomes significantly affected by the feature vectors read more recently. Therefore, in text or speech processing, the cause-effect link between distant words in a long sentence can be lost.
实践中最有效的循环神经网络模型是门控 RNN。其中包括长短期记忆(LSTM) 网络和基于门控循环单元(GRU) 的网络。
The most effective recurrent neural network models used in practice are gated RNNs. These include the long short-term memory (LSTM) networks and networks based on the gated recurrent unit (GRU).
在 RNN 中使用门控单元的优点在于,此类网络可以在其单元中存储信息以供将来使用,就像计算机内存中的位一样。与真实存储器的区别在于,每个单元中存储的信息的读取、写入和擦除是由取值范围内的激活函数控制的。经过训练的神经网络可以“读取”特征向量的输入序列并在某个早期时间步做出决定保留有关特征向量的特定信息。模型稍后可以使用有关早期特征向量的信息来处理输入序列末尾附近的特征向量。例如,如果输入文本以单词she开头,则语言处理 RNN 模型可以决定存储有关性别的信息,以正确解释他们稍后在句子中看到的单词。
The beauty of using gated units in RNNs is that such networks can store information in their units for future use, much like bits in a computer’s memory. The difference with the real memory is that reading, writing, and erasure of information stored in each unit is controlled by activation functions that take values in the range . The trained neural network can “read” the input sequence of feature vectors and decide at some early time step to keep specific information about the feature vectors. That information about the earlier feature vectors can later be used by the model to process the feature vectors from near the end of the input sequence. For example, if the input text starts with the word she, a language processing RNN model could decide to store the information about the gender to interpret correctly the word their seen later in the sentence.
各单元决定存储哪些信息以及何时允许读取、写入和擦除。这些决策是从数据中学习的,并通过门的概念来实施。门控单元有多种架构。一种简单但有效的称为最小门控 GRU,由存储单元和遗忘门组成。
Units make decisions about what information to store, and when to allow reads, writes, and erasures. Those decisions are learned from data and implemented through the concept of gates. There are several architectures of gated units. A simple but effective one is called the minimal gated GRU and is composed of a memory cell, and a forget gate.
让我们以 RNN 第一层(将特征向量序列作为输入的层)为例来看看 GRU 单元的数学计算。最小门控 GRU 单元层内接受两个输入:来自前一个时间步的同一层中所有单元的记忆单元值的向量,,和一个特征向量。然后它按如下方式使用这两个向量(以下序列中的所有操作都在单元中依次执行):
Let’s look at the math of a GRU unit on an example of the first layer of the RNN (the one that takes the sequence of feature vectors as input). A minimal gated GRU unit in layer takes two inputs: the vector of the memory cell values from all units in the same layer from the previous timestep, , and a feature vector . It then uses these two vectors as follows (all operations in the below sequence are executed in the unit one after another):
在哪里是tanh激活函数,称为门函数,并实现为 sigmoid 函数,取值范围。如果门接近,然后存储单元保持前一个时间步的值,。另一方面,如果门接近,存储单元的值被新值覆盖(参见从上往下数第三个作业)。就像标准 RNN 中一样,通常是softmax。
where is the tanh activation function, is called the gate function and is implemented as the sigmoid function taking values in the range . If the gate is close to , then the memory cell keeps its value from the previous time step, . On the other hand, if the gate is close to , the value of the memory cell is overwritten by a new value (see the third assignment from the top). Just like in standard RNNs, is usually softmax.
门控单元接受输入并将其存储一段时间。这相当于应用恒等函数 () 到输入。由于恒等函数的导数是恒定的,因此当具有门控单元的网络通过时间反向传播进行训练时,梯度不会消失。
A gated unit takes an input and stores it for some time. This is equivalent to applying the identity function () to the input. Because the derivative of the identity function is constant, when a network with gated units is trained with backpropagation through time, the gradient does not vanish.
RNN 的其他重要扩展包括双向 RNN 、具有注意力的RNN和序列到序列 RNN模型。尤其是后者,经常用于构建神经机器翻译模型和其他文本到文本转换的模型。 RNN 的推广是递归神经网络。
Other important extensions to RNNs include bi-directional RNNs, RNNs with attention and sequence-to-sequence RNN models. The latter, in particular, are frequently used to build neural machine translation models and other models for text to text transformations. A generalization of an RNN is a recursive neural network.
标量函数输出标量,即简单的数字而不是向量。↩
A scalar function outputs a scalar, that is a simple number and not a vector.↩
该函数必须在其整个域或域的大多数点上可微。例如,ReLU 在以下位置不可微分:。↩
The function has to be differentiable across its whole domain or in the majority of the points of its domain. For example, ReLU is not differentiable at .↩
图像的每个像素都是一个特征。如果我们的图像是 100 x 100 像素,那么就有 10,000 个特征。↩
Each pixel of an image is a feature. If our image is 100 by 100 pixels, then there are 10,000 features.↩
想象一下,就像您在显微镜下观察一张美元钞票一样。要查看整个账单,您必须逐渐从左到右、从上到下移动账单。在每个时刻,您只能看到固定尺寸账单的一部分。这种方法称为移动窗口。↩
Consider this as if you looked at a dollar bill in a microscope. To see the whole bill you have to gradually move your bill from left to right and from top to bottom. At each moment in time, you see only a part of the bill of fixed dimensions. This approach is called moving window.↩
To save space, in fig. 27, only the first two of the nine convolutions are shown.↩
我们讨论了线性回归,但是如果我们的数据不具有直线形式怎么办?多项式回归可能会有所帮助。假设我们有一个一维数据。我们可以尝试拟合二次线到我们的数据。通过定义均方误差(MSE)成本函数,我们可以应用梯度下降并找到参数值,, 和最小化这个成本函数。在一维或二维空间中,我们可以很容易地看出函数是否适合数据。但是,如果我们的输入是维特征向量,其中,找到正确的多项式会很困难。
We talked about linear regression, but what if our data doesn’t have the form of a straight line? Polynomial regression could help. Let’s say we have a one-dimensional data . We could try to fit a quadratic line to our data. By defining the mean squared error (MSE) cost function, we could apply gradient descent and find the values of parameters , , and that minimize this cost function. In one- or two-dimensional space, we can easily see whether the function fits the data. However, if our input is a -dimensional feature vector, with , finding the right polynomial would be hard.
核回归是一种非参数方法。这意味着没有需要学习的参数。该模型基于数据本身(如 kNN)。最简单的形式是,在内核回归中,我们寻找这样的模型:
Kernel regression is a non-parametric method. That means that there are no parameters to learn. The model is based on the data itself (like in kNN). In its simplest form, in kernel regression we look for a model like this:
在哪里
where
功能称为内核。内核扮演相似函数的角色:系数值当类似于当它们不相似时则更低。核可以有不同的形式,最常用的一种是高斯核:
The function is called a kernel. The kernel plays the role of a similarity function: the values of coefficients are higher when is similar to and lower when they are dissimilar. Kernels can have different forms, the most frequently used one is the Gaussian kernel:
价值是我们使用验证集调整的超参数(通过运行使用特定值构建的模型)验证集示例并计算 MSE)。你可以看一下影响力的图示回归线的形状如图所示。 31-图。 33 .
The value is a hyperparameter that we tune using the validation set (by running the model built with a specific value of on the validation set examples and calculating the MSE). You can see an illustration of the influence has on the shape of the regression line in fig. 31-fig. 33.
如果您的输入是多维特征向量,则术语和在等式中 21必须用欧氏距离代替和分别。
If your inputs are multi-dimensional feature vectors, the terms and in eq. 21 have to be replaced by Euclidean distance and respectively.
虽然许多分类问题可以使用两个类来定义,但有些分类问题是用两个以上的类来定义的,这需要我们的机器学习算法进行调整。
Although many classification problems can be defined using two classes, some are defined with more than two classes, which requires adaptations of our machine learning algorithms.
在多类分类中,标签可以是以下之一课程:。许多机器学习算法都是二进制的;支持向量机就是一个例子。一些算法可以自然地扩展到处理多类问题。 ID3和其他决策树学习算法可以简单地改变如下:
In multiclass classification, the label can be one of classes: . Many machine learning algorithms are binary; SVM is an example. Some algorithms can naturally be extended to handle multiclass problems. ID3 and other decision tree learning algorithms can be simply changed like this:
对全部, 在哪里是进行预测的叶节点。
for all , where is the leaf node in which the prediction is made.
通过用我们在第 6 章中已经看到的softmax 函数替换 sigmoid函数,逻辑回归可以自然地扩展到多类学习问题。
Logistic regression can be naturally extended to multiclass learning problems by replacing the sigmoid function with the softmax function which we already saw in Chapter 6.
kNN 算法也可以直接扩展到多类情况:当我们找到最接近输入的示例并检查它们,我们返回我们在其中看到最多的类例子。
The kNN algorithm is also straightforward to extend to the multiclass case: when we find the closest examples for the input and examine them, we return the class that we saw the most among the examples.
SVM 不能自然地扩展到多类问题。其他算法可以在二进制情况下更有效地实现。如果您遇到多类问题但采用二元分类学习算法,该怎么办?一种常见的策略称为“一对一”。这个想法是将多类问题转化为二元分类问题和构建二元分类器。例如,如果我们有三个班级,,我们创建原始数据集的副本并修改它们。在第一个副本中,我们替换所有不等于的标签经过。在第二个副本中,我们替换所有不等于的标签经过。在第三个副本中,我们替换所有不等于的标签经过。现在我们有三个二元分类问题,我们必须学会区分标签和,和, 和和。
SVM cannot be naturally extended to multiclass problems. Other algorithms can be implemented more efficiently in the binary case. What should you do if you have a multiclass problem but a binary classification learning algorithm? One common strategy is called one versus rest. The idea is to transform a multiclass problem into binary classification problems and build binary classifiers. For example, if we have three classes, , we create copies of the original datasets and modify them. In the first copy, we replace all labels not equal to by . In the second copy, we replace all labels not equal to by . In the third copy, we replace all labels not equal to by . Now we have three binary classification problems where we have to learn to distinguish between labels and , and , and and .
一旦我们有了三个模型,就可以对新的输入特征向量进行分类,我们将三个模型应用于输入,并得到三个预测。然后我们选择最确定的非零类的预测。请记住,在逻辑回归中,模型返回的不是标签而是分数(介于和)可以解释为标签为正的概率。我们也可以将这个分数解释为预测的确定性。在SVM中,确定性的模拟是距离从输入到由下式给出的决策边界,
Once we have the three models, to classify the new input feature vector , we apply the three models to the input, and we get three predictions. We then pick the prediction of a non-zero class which is the most certain. Remember that in logistic regression, the model returns not a label but a score (between and ) that can be interpreted as the probability that the label is positive. We can also interpret this score as the certainty of prediction. In SVM, the analog of certainty is the distance from the input to the decision boundary given by,
距离越大,预测就越确定。大多数学习算法要么可以自然地转换为多类情况,要么返回一个我们可以在单与休息策略中使用的分数。
The larger the distance, the more certain is the prediction. Most learning algorithm either can be naturally converted to a multiclass case, or they return a score we can use in the one versus rest strategy.
有时我们只有一个类别的示例,并且我们希望训练一个模型来区分该类别的示例与其他类别的示例。
Sometimes we only have examples of one class and we want to train a model that would distinguish examples of this class from everything else.
一类分类,也称为一元分类或类建模,试图通过从仅包含该类对象的训练集中学习来识别所有对象中特定类的对象。这与传统的分类问题不同,并且比传统的分类问题更困难,传统的分类问题试图通过包含所有类别的对象的训练集来区分两个或多个类别。典型的一类分类问题是对安全计算机网络中的流量进行正常分类。在这种情况下,受到攻击或入侵期间的流量示例(如果有的话)也很少。然而,正常流量的例子往往很多。一类分类学习算法用于异常值检测、异常检测和新颖性检测。
One-class classification, also known as unary classification or class modeling, tries to identify objects of a specific class among all objects, by learning from a training set containing only the objects of that class. That is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all classes. A typical one-class classification problem is the classification of the traffic in a secure computer network as normal. In this scenario, there are few, if any, examples of the traffic under an attack or during an intrusion. However, the examples of normal traffic are often in abundance. One-class classification learning algorithms are used for outlier detection, anomaly detection, and novelty detection.
有几种一类学习算法。实践中使用最广泛的是一类高斯、一类 k-means、一类 kNN和一类 SVM。
There are several one-class learning algorithms. The most widely used in practice are one-class Gaussian, one-class k-means, one-class kNN, and one-class SVM.
一类高斯背后的想法是,我们对数据进行建模,就好像它来自高斯分布,更准确地说是多元正态分布(MND)。 MND 的概率密度函数 (pdf) 由以下等式给出:
The idea behind the one-class gaussian is that we model our data as if it came from a Gaussian distribution, more precisely multivariate normal distribution (MND). The probability density function (pdf) for MND is given by the following equation:
在哪里返回输入特征向量对应的概率密度。概率密度可以解释为例子的可能性是从我们建模为 MND 的概率分布中得出的。价值观(一个向量)和(一个矩阵)是我们必须学习的参数。优化最大似然准则(类似于我们解决逻辑回归学习问题的方式)以找到这两个参数的最佳值。是矩阵的行列式;符号表示矩阵的逆。
where returns the probability density corresponding to the input feature vector . Probability density can be interpreted as the likelihood that example was drawn from the probability distribution we model as an MND. Values (a vector) and (a matrix) are the parameters we have to learn. The maximum likelihood criterion (similarly to how we solved the logistic regression learning problem) is optimized to find the optimal values for these two parameters. is the determinant of the matrix ; the notation means the inverse of the matrix .
如果行列式和倒数这两个术语对您来说比较陌生,请不要担心。这些是来自称为矩阵论的数学分支的向量和矩阵的标准运算。如果您觉得有必要知道它们是什么,维基百科很好地解释了这些概念。
If the terms determinant and inverse are new to you, don’t worry. These are standard operations on vector and matrices from the branch of mathematics called matrix theory. If you feel the need to know what they are, Wikipedia explains these concepts well.
实际上,向量中的数字确定高斯分布曲线的中心位置,而确定曲线的形状。对于由二维特征向量组成的训练集,图 2 给出了一类高斯模型的示例。 34-图。 35 .
In practice, the numbers in the vector determine the place where the curve of our Gaussian distribution is centered, while the numbers in determine the shape of the curve. For a training set consisting of two-dimensional feature vectors, an example of the one-class Gaussian model is given in fig. 34-fig. 35.
一旦我们的模型参数化为和从数据中学习,我们预测每个输入的可能性通过使用。仅当可能性高于某个阈值时,我们才预测该示例属于我们的类别;否则,将其归类为异常值。阈值是通过实验或使用“有根据的猜测”找到的。
Once we have our model parametrized by and learned from data, we predict the likelihood of every input by using . Only if the likelihood is above a certain threshold, we predict that the example belongs to our class; otherwise, it is classified as the outlier. The value of the threshold is found experimentally or using an “educated guess.”
当数据具有更复杂的形状时,更高级的算法可以使用多个高斯的组合(称为高斯混合)。在这种情况下,有更多的参数需要从数据中学习:一和一个对于每个高斯以及允许组合多个高斯形成一个 pdf 的参数。在第 9 章中,我们考虑混合高斯分布及其在聚类中的应用。
When the data has a more complex shape, a more advanced algorithm can use a combination of several Gaussians (called a mixture of Gaussians). In this case, there are more parameters to learn from data: one and one for each Gaussian as well as the parameters that allow combining multiple Gaussians to form one pdf. In Chapter 9, we consider a mixture of Gaussians with an application to clustering.
一类 k 均值和一类 kNN 基于与一类高斯类似的原理:构建一些数据模型,然后定义一个阈值来决定我们的新特征向量是否与其他示例相似该模型。在前者中,所有训练示例都使用k-means聚类算法进行聚类,并且当一个新示例被观察到,距离计算为之间的最小距离以及每个簇的中心。如果小于特定阈值,则属于班级。
One-class k-means and one-class kNN are based on a similar principle as that of one-class Gaussian: build some model of the data and then define a threshold to decide whether our new feature vector looks similar to other examples according to the model. In the former, all training examples are clustered using the k-means clustering algorithm and, when a new example is observed, the distance is calculated as the minimum distance between and the center of each cluster. If is less than a particular threshold, then belongs to the class.
根据公式的不同,一类 SVM 尝试 1) 将所有训练样例与原点(在特征空间中)分离并最大化超平面到原点的距离,或者 2) 通过以下方式获得数据周围的球形边界:最小化这个超球面的体积。我留下一类 kNN 算法的描述,以及一类 k 均值和一类 SVM 的详细信息,以供补充阅读。
One-class SVM, depending on formulation, tries either 1) to separate all training examples from the origin (in the feature space) and maximize the distance from the hyperplane to the origin, or 2) to obtain a spherical boundary around the data by minimizing the volume of this hypersphere. I leave the description of the one-class kNN algorithm, as well as the details of the one-class k-means and one-class SVM for the complementary reading.
在某些情况下,多个标签适合描述数据集中的示例。在这种情况下,我们谈论多标签分类。
In some situations, more than one label is appropriate to describe an example from the dataset. In this case, we talk about the multi-label classification.
例如,如果我们想要描述一幅图像,我们可以为其分配几个标签:“针叶树”、“山”、“道路”,同时这三个标签(图 36)。
For instance, if we want to describe an image, we could assign several labels to it: “conifer,” “mountain,” “road,” all three at the same time (fig. 36).
如果标签的可能值的数量很多,但它们都具有相同的性质,例如标签,我们可以将每个带标签的示例转换为多个带标签的示例,每个标签一个。这些新示例都具有相同的特征向量和只有一个标签。这就变成了一个多类分类问题。我们可以使用一对一策略来解决这个问题。与通常的多类问题的唯一区别是,现在我们有了一个新的超参数:阈值。如果某个标签的预测分数高于阈值,则针对输入特征向量预测该标签。在这种情况下,可以为一个特征向量预测多个标签。使用验证集选择阈值。
If the number of possible values for labels is high, but they are all of the same nature, like tags, we can transform each labeled example into several labeled examples, one per label. These new examples all have the same feature vector and only one label. That becomes a multiclass classification problem. We can solve it using the one versus rest strategy. The only difference with the usual multiclass problem is that now we have a new hyperparameter: threshold. If the prediction score for some label is above the threshold, this label is predicted for the input feature vector. In this scenario, multiple labels can be predicted for one feature vector. The value of the threshold is chosen using the validation set.
类似地,自然可以进行多分类的算法(决策树、逻辑回归和神经网络等)可以应用于多标签分类问题。因为它们返回每个类别的分数,所以我们可以定义一个阈值,然后如果阈值高于某个值,则将多个标签分配给一个特征向量。
Analogously, algorithms that naturally can be made multiclass (decision trees, logistic regression and neural networks among others) can be applied to multi-label classification problems. Because they return the score for each class, we can define a threshold and then assign multiple labels to one feature vector if the threshold is above some value.
神经网络算法可以利用二元交叉熵成本函数自然地训练多标签分类模型。在这种情况下,神经网络的输出层每个标签都有一个单元。输出层的每个单元都有sigmoid激活函数。相应地,每个标签是二进制的(), 在哪里和。预测概率的二元交叉熵那个例子有标签定义为,
Neural networks algorithms can naturally train multi-label classification models by using the binary cross-entropy cost function. The output layer of the neural network, in this case, has one unit per label. Each unit of the output layer has the sigmoid activation function. Accordingly, each label is binary (), where and . The binary cross-entropy of predicting the probability that example has label is defined as,
最小化标准只是所有训练示例和这些示例的所有标签的所有二元交叉熵项的平均值。
The minimization criterion is simply the average of all binary cross-entropy terms across all training examples and all labels of those examples.
如果每个标签可以采用的可能值的数量很少,则可以使用不同的方法将多标签问题转换为多类问题。想象一下以下问题。我们想要标记图像,标签可以有两种类型。第一种类型的标签可以有两个可能的值:;第二种类型的标签可以有三个可能的值。我们可以为两个原始类的每个组合创建一个新的假类,如下所示:
In cases where the number of possible values each label can take is small, one can convert multilabel into a multiclass problem using a different approach. Imagine the following problem. We want to label images and labels can be of two types. The first type of label can have two possible values: ; the label of the second type can have three possible values . We can create a new fake class for each combination of the two original classes, like this:
| 假班级 | 真实1级 | 真实2级 |
|---|---|---|
| 1 | 照片 | 肖像 |
| 2 | 照片 | 风景 |
| 3 | 照片 | 其他 |
| 4 | 绘画 | 肖像 |
| 5 | 绘画 | 风景 |
| 6 | 绘画 | 其他 |
现在我们有相同的标记示例,但我们用一个假标签替换真正的多标签,其值来自到。当没有太多可能的类组合时,这种方法在实践中效果很好。否则,您需要使用更多的训练数据来补偿增加的类集。
Now we have the same labeled examples, but we replace real multi-labels with one fake label with values from to . This approach works well in practice when there are not too many possible combinations of classes. Otherwise, you need to use much more training data to compensate for an increased set of classes.
后一种方法的主要优点是使标签保持相关性,这与之前看到的相互独立预测每个标签的方法相反。标签之间的相关性在许多问题中至关重要。例如,如果您想在预测一封电子邮件是普通电子邮件还是优先电子邮件的同时,预测该电子邮件是垃圾邮件还是not_spam。你想避免像这样的预测。
The primary advantage of this latter approach is that you keep your labels correlated, contrary to the previously seen methods that predict each label independently of one another. Correlation between labels can be essential in many problems. For example, if you want to predict whether an email is spam or not_spam at the same time as predicting whether it’s ordinary or priority email. You would like to avoid predictions like .
我们在第 3 章中考虑的基本算法有其局限性。由于它们的简单性,有时它们无法为您的问题生成足够准确的模型。您可以尝试使用深度神经网络。然而,在实践中,深度神经网络需要大量您可能没有的标记数据。另一种提高简单学习算法性能的方法是集成学习。
The fundamental algorithms that we considered in Chapter 3 have their limitations. Because of their simplicity, sometimes they cannot produce a model accurate enough for your problem. You could try using deep neural networks. However, in practice, deep neural networks require a significant amount of labeled data which you might not have. Another approach to boost the performance of simple learning algorithms is ensemble learning.
集成学习是一种学习范式,它不是试图学习一个超准确的模型,而是专注于训练大量的低准确度模型,然后组合这些弱模型给出的预测以获得高精度的元模型。
Ensemble learning is a learning paradigm that, instead of trying to learn one super-accurate model, focuses on training a large number of low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model.
低精度模型通常由弱学习器学习,即无法学习复杂模型的学习算法,因此在训练和预测时通常速度很快。最常用的弱学习器是决策树学习算法,在该算法中,我们经常在几次迭代后就停止分割训练集。获得的树是浅层的并且不是特别准确,但是集成学习背后的思想是,如果树不相同并且每棵树至少比随机猜测稍好,那么我们可以通过组合大量这样的树来获得高精度。
Low-accuracy models are usually learned by weak learners, that is, learning algorithms that cannot learn complex models, and thus are typically fast at the training and at the prediction time. The most frequently used weak learner is a decision tree learning algorithm in which we often stop splitting the training set after just a few iterations. The obtained trees are shallow and not particularly accurate, but the idea behind ensemble learning is that if the trees are not identical and each tree is at least slightly better than random guessing, then we can obtain high accuracy by combining a large number of such trees.
获得输入的预测,每个弱模型的预测都使用某种加权投票进行组合。投票权重的具体形式取决于算法,但是,独立于算法,其想法是相同的:如果弱模型理事会预测该消息是垃圾邮件,那么我们将垃圾邮件标签分配给。
To obtain the prediction for input , the predictions of each weak model are combined using some sort of weighted voting. The specific form of vote weighting depends on the algorithm, but, independently of the algorithm, the idea is the same: if the council of weak models predicts that the message is spam, then we assign the label spam to .
两种主要的集成学习方法是boosting和bagging。
Two principal ensemble learning methods are boosting and bagging.
Boosting 包括使用原始训练数据并使用弱学习器迭代创建多个模型。每个新模型都与以前的模型不同,因为弱学习者通过构建每个新模型试图“修复”以前模型所犯的错误。最终的集成模型是迭代构建的多个弱模型的某种组合。
Boosting consists of using the original training data and iteratively creating multiple models by using a weak learner. Each new model would be different from the previous ones in the sense that the weak learner, by building each new model tries to “fix” the errors which previous models make. The final ensemble model is a certain combination of those multiple weak models built iteratively.
Bagging 包括创建训练数据的许多“副本”(每个副本都与另一个副本略有不同),然后将弱学习器应用于每个副本以获得多个弱模型,然后将它们组合起来。一种广泛使用且有效的基于装袋思想的机器学习算法是随机森林。
Bagging consists of creating many “copies” of the training data (each copy is slightly different from another) and then apply the weak learner to each copy to obtain multiple weak models and then combine them. A widely used and effective machine learning algorithm based on the idea of bagging is random forest.
“vanilla”装袋算法的工作原理如下。给定一个训练集,我们创建随机样本(对于每个) 训练集并构建决策树模型使用每个样本作为训练集。来样对于一些,我们进行放回抽样。这意味着我们从一个空集开始,然后从训练集中随机选择一个示例,并将其精确副本放入通过将原始示例保留在原始训练集中。我们不断随机挑选例子,直到。
The “vanilla” bagging algorithm works as follows. Given a training set, we create random samples (for each ) of the training set and build a decision tree model using each sample as the training set. To sample for some , we do the sampling with replacement. This means that we start with an empty set, and then pick at random an example from the training set and put its exact copy to by keeping the original example in the original training set. We keep picking examples at random until the .
经过训练,我们有决策树。对新示例的预测获得的平均值预测:
After training, we have decision trees. The prediction for a new example is obtained as the average of predictions:
在回归的情况下,或者在分类的情况下通过多数投票。
in the case of regression, or by taking the majority vote in the case of classification.
随机森林与普通装袋法只有一处不同。它使用改进的树学习算法,在学习过程中的每次分割时检查特征的随机子集。这样做的原因是为了避免树的相关性:如果一个或几个特征对于目标来说是非常强的预测因子,那么这些特征将被选择来分割许多树中的示例。这将导致我们的“森林”中出现许多相关的树。相关预测变量无助于提高预测的准确性。模型集成性能更好的主要原因是,好的模型可能会同意相同的预测,而坏的模型可能会不同意不同的预测。相关性将使糟糕的模型更有可能达成一致,这将妨碍多数投票或平均值。
Random forest is different from the vanilla bagging in just one way. It uses a modified tree learning algorithm that inspects, at each split in the learning process, a random subset of the features. The reason for doing this is to avoid the correlation of the trees: if one or a few features are very strong predictors for the target, these features will be selected to split examples in many trees. This would result in many correlated trees in our “forest.” Correlated predictors cannot help in improving the accuracy of prediction. The main reason behind a better performance of model ensembling is that models that are good will likely agree on the same prediction, while bad models will likely disagree on different ones. Correlation will make bad models more likely to agree, which will hamper the majority vote or the average.
要调整的最重要的超参数是树的数量,,以及每次分割时要考虑的随机特征子集的大小。
The most important hyperparameters to tune are the number of trees, , and the size of the random subset of the features to consider at each split.
随机森林是最广泛使用的集成学习算法之一。为什么这么有效?原因是通过使用原始数据集的多个样本,我们减少了最终模型的方差。请记住,低方差意味着低过度拟合。当我们的模型试图解释数据集中的微小变化时,就会发生过度拟合,因为我们的数据集只是我们尝试建模的现象的所有可能示例的总体的一小部分样本。如果我们对训练集的采样方式不走运,那么它可能会包含一些不需要的(但不可避免的)伪影:噪声、异常值以及过度或代表性不足的示例。通过创建多个随机样本并替换我们的训练集,我们减少了这些伪影的影响。
Random forest is one of the most widely used ensemble learning algorithms. Why is it so effective? The reason is that by using multiple samples of the original dataset, we reduce the variance of the final model. Remember that the low variance means low overfitting. Overfitting happens when our model tries to explain small variations in the dataset because our dataset is just a small sample of the population of all possible examples of the phenomenon we try to model. If we were unlucky with how our training set was sampled, then it could contain some undesirable (but unavoidable) artifacts: noise, outliers and over- or underrepresented examples. By creating multiple random samples with replacement of our training set, we reduce the effect of these artifacts.
另一种有效的集成学习算法是基于 boosting 的思想,即梯度提升。让我们首先看看回归的梯度提升。为了构建一个强大的回归器,我们从一个常数模型开始(就像我们在 ID3 中所做的那样):
Another effective ensemble learning algorithm, based on the idea of boosting, is gradient boosting. Let’s first look at gradient boosting for regression. To build a strong regressor, we start with a constant model (just like we did in ID3):
然后我们修改每个例子的标签在我们的训练集中如下:
Then we modify labels of each example in our training set as follows:
在哪里,称为残差,是新标签,例如。
where , called the residual, is the new label for example .
现在我们使用修改后的训练集,用残差代替原始标签,来构建新的决策树模型,。提升模型现在定义为, 在哪里是学习率(超参数)。
Now we use the modified training set, with residuals instead of original labels, to build a new decision tree model, . The boosting model is now defined as , where is the learning rate (a hyperparameter).
然后我们使用等式重新计算残差。 22再次替换训练数据中的标签,训练新的决策树模型,将Boosting模型重新定义为这个过程一直持续到预定义的最大值的树组合在一起。
Then we recompute the residuals using eq. 22 and replace the labels in the training data once again, train the new decision tree model , redefine the boosting model as and the process continues until the predefined maximum of trees are combined.
直觉上,这里发生了什么?通过计算残差,我们可以发现当前模型对每个训练示例的目标的预测效果如何(或较差)。然后,我们训练另一棵树来修复当前模型的错误(这就是我们使用残差而不是真实标签的原因),并将这棵新树添加到具有一定权重的现有模型中。因此,添加到模型中的每棵附加树都会部分修复先前树所产生的错误,直到达到最大数量(另一个超参数)树被组合。
Intuitively, what’s happening here? By computing the residuals, we find how well (or poorly) the target of each training example is predicted by the current model . We then train another tree to fix the errors of the current model (this is why we use residuals instead of real labels) and add this new tree to the existing model with some weight . Therefore, each additional tree added to the model partially fixes the errors made by the previous trees until the maximum number (another hyperparameter) of trees are combined.
现在你应该合理地问为什么该算法被称为梯度提升?在梯度提升中,我们不会计算任何与第 4 章线性回归中所做的相反的梯度。要了解梯度提升和梯度下降之间的相似性,请记住我们在线性回归中计算梯度的原因:我们这样做是为了了解应该将参数值移动到哪里,以便 MSE 成本函数达到最小值。渐变显示了方向,但我们不知道这个方向应该走多远,所以我们使用了一小步在每次迭代中,然后重新评估我们的方向。同样的情况也发生在梯度提升中。然而,我们不是直接获取梯度,而是以残差的形式使用它的代理:它们向我们展示了如何调整模型以减少误差(残差)。
Now you should reasonably ask why the algorithm is called gradient boosting? In gradient boosting, we don’t calculate any gradient contrary to what we did in Chapter 4 for linear regression. To see the similarity between gradient boosting and gradient descent remember why we calculated the gradient in linear regression: we did that to get an idea on where we should move the values of our parameters so that the MSE cost function reaches its minimum. The gradient showed the direction, but we didn’t know how far we should go in this direction, so we used a small step at each iteration and then reevaluated our direction. The same happens in gradient boosting. However, instead of getting the gradient directly, we use its proxy in the form of residuals: they show us how the model has to be adjusted so that the error (the residual) is reduced.
梯度提升中需要调整的三个主要超参数是树的数量、学习率和树的深度——这三个参数都会影响模型的准确性。树的深度也会影响训练和预测的速度:越短,越快。
The three principal hyperparameters to tune in gradient boosting are the number of trees, the learning rate, and the depth of trees — all three affect model accuracy. The depth of trees also affects the speed of training and prediction: the shorter, the faster.
可以看出,残差训练优化了整体模型为均方误差准则。您可以在这里看到与 bagging 的区别:boosting 减少了偏差(或欠拟合)而不是方差。因此,Boosting 可能会过度拟合。然而,通过调整深度和树的数量,可以在很大程度上避免过度拟合。
It can be shown that training on residuals optimizes the overall model for the mean squared error criterion. You can see the difference with bagging here: boosting reduces the bias (or underfitting) instead of the variance. As such, boosting can overfit. However, by tuning the depth and the number of trees, overfitting can be largely avoided.
分类的梯度提升类似,但步骤略有不同。让我们考虑二进制情况。假设我们有回归决策树。与逻辑回归类似,决策树集合的预测是使用 sigmoid 函数建模的:
Gradient boosting for classification is similar, but the steps are slightly different. Let’s consider the binary case. Assume we have regression decision trees. Similarly to logistic regression, the prediction of the ensemble of decision trees is modeled using the sigmoid function:
在哪里和是一棵回归树。
where and is a regression tree.
再次,就像在逻辑回归中一样,我们通过尝试找到这样一个最大似然原理来应用最大化。同样,为了避免数值溢出,我们最大化对数似然之和而不是似然乘积。
Again, like in logistic regression, we apply the maximum likelihood principle by trying to find such an that maximizes . Again, to avoid numerical overflow, we maximize the sum of log-likelihoods rather than the product of likelihoods.
该算法从初始常数模型开始, 在哪里。 (可以证明这样的初始化对于 sigmoid 函数来说是最优的。)然后在每次迭代时,一棵新树被添加到模型中。寻找最好的,首先是偏导数当前模型的计算为每个:
The algorithm starts with the initial constant model , where . (It can be shown that such initialization is optimal for the sigmoid function.) Then at each iteration , a new tree is added to the model. To find the best , first the partial derivative of the current model is calculated for each :
在哪里是在上一次迭代中构建的集成分类器模型。计算我们需要找到的导数关于对全部。请注意。上式中右侧项关于以下项的导数等于。
where is the ensemble classifier model built at the previous iteration . To calculate we need to find the derivatives of with respect to for all . Notice that . The derivative of the right-hand term in the previous equation with respect to equals .
然后我们通过替换原始标签来转换我们的训练集与相应的偏导数,并构建一棵新树使用转换后的训练集。然后我们找到最优的更新步长作为:
We then transform our training set by replacing the original label with the corresponding partial derivative , and build a new tree using the transformed training set. Then we find the optimal update step as:
迭代结束时,我们更新集成模型通过添加新树:
At the end of iteration , we update the ensemble model by adding the new tree :
我们迭代直到,然后我们停止并返回集成模型。
We iterate until , then we stop and return the ensemble model .
梯度提升是最强大的机器学习算法之一不仅因为它创建了非常准确的模型,还因为它能够处理包含数百万个示例和特征的庞大数据集。它的准确性通常优于随机森林,但由于其顺序性质,训练速度可能会慢得多。
Gradient boosting is one of the most powerful machine learning algorithmsnot just because it creates very accurate models, but also because it is capable of handling huge datasets with millions of examples and features. It usually outperforms random forest in accuracy but, because of its sequential nature, can be significantly slower in training.
序列是最常见的结构化数据类型之一。我们使用单词和句子的序列进行交流,我们按顺序执行任务,我们的基因、我们听的音乐和我们观看的视频、我们对连续过程(例如移动的汽车或股票价格)的观察都是连续的。
Sequence is one the most frequently observed types of structured data. We communicate using sequences of words and sentences, we execute tasks in sequences, our genes, the music we listen and videos we watch, our observations of a continuous process, such as a moving car or the price of a stock are all sequential.
序列标记是自动为序列的每个元素分配标签的问题。序列标记中的标记顺序训练示例是一对列表, 在哪里是一个特征向量列表,每个时间步一个,是相同长度标签的列表。例如,可以代表句子中的单词,例如 [“big”、“beautiful”、“car”],以及将是相应词性的列表,例如[“形容词”,“形容词”,“名词”])。更正式地说,在一个例子中,, 在哪里是示例序列的长度,和。
Sequence labeling is the problem of automatically assigning a label to each element of a sequence. A labeled sequential training example in sequence labeling is a pair of lists , where is a list of feature vectors, one per time step, is a list of the same length of labels. For example, could represent words in a sentence such as [“big”, “beautiful”, “car”], and would be the list of the corresponding parts of speech, such as [“adjective”, “adjective”, “noun”]). More formally, in an example , , where is the length of the sequence of the example , and .
您已经看到 RNN 可用于标记序列。在每个时间步,它读取输入特征向量,最后一个循环层输出一个标签(在二进制标记的情况下)或(在多类或多标签标记的情况下)。
You have already seen that an RNN can be used to label a sequence. At each time step , it reads an input feature vector , and the last recurrent layer outputs a label (in the case of binary labeling) or (in the case of multiclass or multilabel labeling).
然而,RNN 并不是唯一可能的序列标记模型。称为条件随机场(CRF)的模型是一种非常有效的替代方案,在实践中对于具有许多信息特征的特征向量通常表现良好。例如,假设我们有命名实体提取的任务,我们想要构建一个模型,用以下类之一来标记句子中的每个单词,例如“我去旧金山”:。如果我们的特征向量(代表单词)包含诸如“该单词是否以大写字母开头”和“该单词是否可以在位置列表中找到”之类的二元特征,那么这些特征将提供非常丰富的信息,并且帮助将单词San和Francisco分类为。
However, RNN is not the only possible model for sequence labeling. The model called Conditional Random Fields (CRF) is a very effective alternative that often performs well in practice for the feature vectors that have many informative features. For example, imagine we have the task of named entity extraction and we want to build a model that would label each word in the sentence such as “I go to San Francisco” with one of the following classes: . If our feature vectors (which represent words) contain such binary features as “whether or not the word starts with a capital letter” and “whether or not the word can be found in the list of locations,” such features would be very informative and help to classify the words San and Francisco as .
众所周知,构建手工制作的功能是一个劳动密集型的过程,需要大量的领域专业知识。
Building handcrafted features is known to be a labor-intensive process that requires a significant level of domain expertise.
CRF 是一个有趣的模型,可以看作是逻辑回归到序列的推广。然而,在实践中,对于序列标记任务,它已被双向深度门控 RNN 超越。 CRF 的训练速度也明显较慢,这使得它们很难应用于大型训练集(包含数十万个示例)。此外,大型训练集是深度神经网络蓬勃发展的基础。
CRF is an interesting model and can be seen as a generalization of logistic regression to sequences. However, in practice, for sequence labeling tasks, it has been outperformed by bidirectional deep gated RNN. CRFs are also significantly slower in training which makes them difficult to apply to large training sets (with hundreds of thousands of examples). Additionally, a large training set is where a deep neural network thrives.
序列到序列学习(通常缩写为 seq2seq 学习)是序列标记问题的推广。在seq2seq中,和可以有不同的长度。 seq2seq 模型已在机器翻译(例如,输入是英语句子,输出是相应的法语句子)、会话界面(其中输入是用户输入的问题,输出是机器回答)、文本摘要、拼写纠正等等。
Sequence-to-sequence learning (often abbreviated as seq2seq learning) is a generalization of the sequence labeling problem. In seq2seq, and can have different lengths. seq2seq models have found application in machine translation (where, for example, the input is an English sentence, and the output is the corresponding French sentence), conversational interfaces (where the input is a question typed by the user, and the output is the answer from the machine), text summarization, spelling correction, and many others.
目前,许多(但并非全部)seq2seq 学习问题最好通过神经网络来解决。 seq2seq 中使用的网络架构都有两部分:编码器和解码器。
Many but not all seq2seq learning problems are currently best solved by neural networks. The network architectures used in seq2seq all have two parts: an encoder and a decoder.
在 seq2seq 神经网络学习中,编码器是接受顺序输入的神经网络。它可以是 RNN,也可以是 CNN 或其他架构。编码器的作用是读取输入并生成某种状态(类似于 RNN 中的状态),可以将其视为机器可以处理的输入含义的数字表示。某些实体的含义,无论是图像、文本还是视频,通常是包含实数的向量或矩阵。在机器学习术语中,这个向量(或矩阵)称为输入的嵌入。
In seq2seq neural network learning, the encoder is a neural network that accepts sequential input. It can be an RNN, but also a CNN or some other architecture. The role of the encoder is to read the input and generate some sort of state (similar to the state in RNN) that can be seen as a numerical representation of the meaning of the input the machine can work with. The meaning of some entity, whether it be an image, a text or a video, is usually a vector or a matrix that contains real numbers. In machine learning jargon, this vector (or matrix) is called the embedding of the input.
解码器是另一个神经网络,它以嵌入作为输入,并能够生成一系列输出。正如您可能已经猜到的那样,嵌入来自编码器。为了产生输出序列,解码器采用序列输入特征向量的开头(通常全为零),产生第一个输出,通过组合嵌入和输入来更新其状态,然后使用输出作为它的下一个输入。为简单起见,维数为可以与以下相同;然而,这并不是绝对必要的。正如我们在第 6 章中看到的,RNN 的每一层都可以同时产生许多输出:其中一个可用于生成标签,而另一个不同维度的可以用作。
The decoder is another neural network that takes an embedding as input and is capable of generating a sequence of outputs. As you could have already guessed, that embedding comes from the encoder. To produce a sequence of outputs, the decoder takes a start of sequence input feature vector (typically all zeroes), produces the first output , updates its state by combining the embedding and the input , and then uses the output as its next input . For simplicity, the dimensionality of can be the same as that of ; however, it is not strictly necessary. As we saw in Chapter 6, each layer of an RNN can produce many simultaneous outputs: one can be used to generate the label , while another one, of different dimensionality, can be used as the .
使用训练数据同时训练编码器和解码器。解码器输出处的错误通过反向传播传播到编码器。
Both encoder and decoder are trained simultaneously using the training data. The errors at the decoder output are propagated to the encoder via backpropagation.
传统的 seq2seq 架构如下图所示:
A traditional seq2seq architecture is illustrated below:
使用具有注意力的架构可以获得更准确的预测。注意力机制是通过一组附加参数来实现的,这些参数组合了来自编码器的一些信息(在 RNN 中,该信息是来自所有编码器时间步的最后一个循环层的状态向量列表)和解码器的当前状态,以生成标签。与门控单元和双向 RNN 相比,这可以更好地保留长期依赖关系。
More accurate predictions can be obtained using an architecture with attention. Attention mechanism is implemented by an additional set of parameters that combine some information from the encoder (in RNNs, this information is the list of state vectors of the last recurrent layer from all encoder time steps) and the current state of the decoder to generate the label. That allows for even better retention of long-term dependencies than provided by gated units and bidirectional RNN.
带有注意力机制的 seq2seq 架构如下图所示:
A seq2seq architecture with attention is illustrated below:
seq2seq 学习是一个相对较新的研究领域。新颖的网络架构会定期被发现和发布。训练此类架构可能具有挑战性,因为需要调整的超参数和其他架构决策的数量可能令人难以承受。请查阅本书的 wiki,了解最先进的材料、教程和代码示例。
seq2seq learning is a relatively new research domain. Novel network architectures are regularly discovered and published. Training such architectures can be challenging as the number of hyperparameters to tune and other architectural decisions can be overwhelming. Consult the book’s wiki for the state of the art material, tutorials and code samples.
主动学习是一种有趣的监督学习范式。当获取标记示例的成本较高时通常应用它。在医疗或金融领域经常出现这种情况,可能需要专家的意见来注释患者或客户的数据。这个想法是从相对较少的标记示例和大量未标记示例开始学习,然后仅标记那些对模型质量贡献最大的示例。
Active learning is an interesting supervised learning paradigm. It is usually applied when obtaining labeled examples is costly. That is often the case in the medical or financial domains, where the opinion of an expert may be required to annotate patients’ or customers’ data. The idea is to start learning with relatively few labeled examples, and a large number of unlabeled ones, and then label only those examples that contribute the most to the model quality.
主动学习有多种策略。这里,我们只讨论以下两个:
There are multiple strategies of active learning. Here, we discuss only the following two:
前一种策略应用当前模型,使用现有的标记示例对每个剩余的未标记示例进行训练(或者,为了节省计算时间,对其中的一些随机样本进行训练)。对于每个未标记的示例,计算以下重要性得分:。密度反映了周围有多少个例子在其附近,同时反映了模型预测的不确定性是为了。在 sigmoid 的二元分类中,预测分数越接近,预测越不确定。在SVM中,样本越接近决策边界,预测的不确定性就越大。
The former strategy applies the current model , trained using the existing labeled examples, to each of the remaining unlabelled examples (or, to save computing time, to some random sample of them). For each unlabeled example , the following importance score is computed: . Density reflects how many examples surround in its close neighborhood, while reflects how uncertain the prediction of the model is for . In binary classification with sigmoid, the closer the prediction score is to , the more uncertain is the prediction. In SVM, the closer the example is to the decision boundary, the most uncertain is the prediction.
在多类分类中,熵可以用作不确定性的典型度量:
In multiclass classification, entropy can be used as a typical measure of uncertainty:
在哪里是模型的概率得分分配给班级分类时。你可以看到,如果对于每个,那么模型是最不确定的,熵达到最大值;另一方面,如果对于某些,,那么模型对于该类是确定的熵最小为。
where is the probability score the model assigns to class when classifying . You can see that if for each , then the model is the most uncertain and the entropy is at its maximum of ; on the other hand, if for some , , then the model is certain about the class and the entropy is at its minimum of .
示例的密度可以通过取距离的平均值来获得对它的每一个最近的邻居(与是一个超参数)。
Density for the example can be obtained by taking the average of the distance from to each of its nearest neighbors (with being a hyperparameter).
一旦我们知道了每个未标记示例的重要性得分,我们就会选择重要性得分最高的一个,并请专家对其进行注释。然后,我们将新的带注释的示例添加到训练集中,重建模型并继续该过程,直到满足某些停止标准。可以提前选择停止标准(根据可用预算向专家发出的最大请求数),或者取决于我们的模型根据某些指标的执行情况。
Once we know the importance score of each unlabeled example, we pick the one with the highest importance score and ask the expert to annotate it. Then we add the new annotated example to the training set, rebuild the model and continue the process until some stopping criterion is satisfied. A stopping criterion can be chosen in advance (the maximum number of requests to the expert based on the available budget) or depend on how well our model performs according to some metric.
基于支持向量的主动学习策略包括使用标记数据构建 SVM 模型。然后,我们要求专家注释最接近分隔两个类的超平面的未标记示例。这个想法是,如果示例最接近超平面,那么它是最不确定的,并且对减少真实(我们寻找的)超平面可能位于的可能位置的贡献最大。
The support vector-based active learning strategy consists in building an SVM model using the labeled data. We then ask our expert to annotate the unlabeled example that lies the closest to the hyperplane that separates the two classes. The idea is that if the example lies closest to the hyperplane, then it is the least certain and would contribute the most to the reduction of possible places where the true (the one we look for) hyperplane could lie.
一些主动学习策略可以包含向专家询问标签的成本。其他人则学会询问专家的意见。 “委员会查询”策略包括使用不同的方法训练多个模型,然后要求专家标记这些模型最不同意的示例。一些策略尝试选择示例进行标记,以便最大程度地减少模型的方差或偏差。
Some active learning strategies can incorporate the cost of asking an expert for a label. Others learn to ask expert’s opinion. The “query by committee” strategy consists of training multiple models using different methods and then asking an expert to label example on which those models disagree the most. Some strategies try to select examples to label so that the variance or the bias of the model are reduced the most.
在半监督学习(SSL)中,我们还标记了数据集的一小部分;其余大部分示例均未标记。我们的目标是利用大量未标记的示例来提高模型性能,而不需要额外的标记示例。
In semi-supervised learning (SSL) we also have labeled a small fraction of the dataset; most of the remaining examples are unlabeled. Our goal is to leverage a large number of unlabeled examples to improve the model performance without asking for additional labeled examples.
历史上,曾多次尝试解决这个问题。它们中没有一个能称得上受到普遍好评并在实践中经常使用。例如,一种经常被引用的 SSL 方法称为自学习。在自学习中,我们使用学习算法使用标记示例构建初始模型。然后我们将模型应用于所有未标记的示例,并使用该模型对它们进行标记。如果某些未标记示例的预测置信度得分高于某个阈值(通过实验选择),然后我们将此标记示例添加到我们的训练集中,重新训练模型并继续这样,直到满足停止标准。例如,如果模型的准确性在过去的过程中没有得到改善,我们可以停止迭代。
Historically, there were multiple attempts at solving this problem. None of them could be called universally acclaimed and frequently used in practice. For example, one frequently cited SSL method is called self-learning. In self-learning, we use a learning algorithm to build the initial model using the labeled examples. Then we apply the model to all unlabeled examples and label them using the model. If the confidence score of prediction for some unlabeled example is higher than some threshold (chosen experimentally), then we add this labeled example to our training set, retrain the model and continue like this until a stopping criterion is satisfied. We could stop, for example, if the accuracy of the model has not been improved during the last iterations.
与仅使用初始标记的数据集相比,上述方法可以给模型带来一些改进,但性能的提升通常并不令人印象深刻。此外,在实践中,模型的质量甚至可能会下降。这取决于数据来源的统计分布的属性,而该属性通常是未知的。
The above method can bring some improvement to the model compared to just using the initially labeled dataset, but the increase in performance usually is not impressive. Furthermore, in practice, the quality of the model could even decrease. That depends on the properties of the statistical distribution the data was drawn from, which is usually unknown.
另一方面,神经网络学习的最新进展带来了一些令人印象深刻的结果。例如,结果表明,对于某些数据集,例如 MNIST(计算机视觉中的常见测试平台,由 0 到 9 的手写数字的标记图像组成),以半监督方式训练的模型具有近乎完美的性能,只需每类 10 个标记示例(总共 100 个标记示例)。为了进行比较,MNIST 包含 70,000 个标记示例(60,000 个用于训练,10,000 个用于测试)。获得如此卓越性能的神经网络架构被称为梯形网络。要了解梯形网络,您必须了解自动编码器是什么。
On the other hand, the recent advancements in neural network learning brought some impressive results. For example, it was shown that for some datasets, such as MNIST (a frequent testbench in computer vision that consists of labeled images of handwritten digits from 0 to 9) the model trained in a semi-supervised way has an almost perfect performance with just 10 labeled examples per class (100 labeled examples overall). For comparison, MNIST contains 70,000 labeled examples (60,000 for training and 10,000 for test). The neural network architecture that attained such a remarkable performance is called a ladder network. To understand ladder networks you have to understand what an autoencoder is.
自动编码器是具有编码器-解码器架构的前馈神经网络。它被训练来重建其输入。所以训练样本是一对。我们想要输出模型的与输入相似尽可能。
An autoencoder is a feed-forward neural network with an encoder-decoder architecture. It is trained to reconstruct its input. So the training example is a pair . We want the output of the model to be as similar to the input as possible.
这里的一个重要细节是,自动编码器的网络看起来像一个沙漏,中间有一个瓶颈层,其中包含-维输入向量;嵌入层的单元通常比。解码器的目标是根据该嵌入重建输入特征向量。理论上来说,只要有瓶颈层中的单元成功编码 MNIST 图像。在图 1 中示意性描述的典型自动编码器中。 如图39所示,成本函数通常是均方误差(当特征可以是任意数字时)或二元交叉熵(当特征是二元并且解码器最后一层的单元具有S形激活函数时)。如果成本是均方误差,则由下式给出:
An important detail here is that an autoencoder’s network looks like an hourglass with a bottleneck layer in the middle that contains the embedding of the -dimensional input vector; the embedding layer usually has much fewer units than . The goal of the decoder is to reconstruct the input feature vector from this embedding. Theoretically, it is sufficient to have units in the bottleneck layer to successfully encode MNIST images. In a typical autoencoder schematically depicted in fig. 39, the cost function is usually either the mean squared error (when features can be any number) or the binary cross-entropy (when features are binary and the units of the last layer of the decoder have the sigmoid activation function). If the cost is the mean squared error, then it is given by:
在哪里是两个向量之间的欧几里德距离。
where is the Euclidean distance between two vectors.
去噪自动编码器破坏了左侧在训练示例中通过向特征添加一些随机扰动。如果我们的示例是灰度图像,像素表示为介于和,通常会在每个特征中添加高斯噪声。对于每个功能输入特征向量的噪声值从高斯分布中采样:
A denoising autoencoder corrupts the left-hand side in the training example by adding some random perturbation to the features. If our examples are grayscale images with pixels represented as values between and , usually a Gaussian noise is added to each feature. For each feature of the input feature vector the noise value is sampled from the Gaussian distribution:
其中符号意思是“采样自”,并且表示具有均值的高斯分布和标准差其pdf由以下给出:
where the notation means “sampled from,” and denotes the Gaussian distribution with mean and standard deviation whose pdf is given by:
在上式中,是常数并且是一个超参数。该功能的新的、损坏的值是(谁)给的。
In the above equation, is the constant and is a hyperparameter. The new, corrupted value of the feature is given by .
梯形网络是一种升级版的去噪自动编码器。编码器和解码器具有相同的层数。瓶颈层直接用于预测标签(使用softmax激活函数)。该网络具有多个成本函数。对于每一层编码器和相应层的解码器,一笔成本惩罚两层输出之间的差异(使用平方欧几里德距离)。当在训练期间使用标记示例时,另一个成本函数,,惩罚标签预测中的错误(使用负对数似然成本函数)。组合成本函数,(对批次中所有示例进行平均),通过带有反向传播的小批量随机梯度下降进行优化。超参数对于每一层确定分类和编码解码成本之间的权衡。
A ladder network is a denoising autoencoder with an upgrade. The encoder and the decoder have the same number of layers. The bottleneck layer is used directly to predict the label (using the softmax activation function). The network has several cost functions. For each layer of the encoder and the corresponding layer of the decoder, one cost penalizes the difference between the outputs of the two layers (using the squared Euclidean distance). When a labeled example is used during training, another cost function, , penalizes the error in prediction of the label (the negative log-likelihood cost function is used). The combined cost function, (averaged over all examples in the batch), is optimized by the minibatch stochastic gradient descent with backpropagation. The hyperparameters for each layer determine the tradeoff between the classification and encoding-decoding cost.
在梯形网络中,不仅输入被噪声破坏,而且每个编码器层的输出(在训练期间)也被破坏。当我们将训练好的模型应用到新的输入时为了预测它的标签,我们不会破坏输入。
In the ladder network, not just the input is corrupted with the noise, but also the output of each encoder layer (during training). When we apply the trained model to the new input to predict its label, we do not corrupt the input.
还存在其他与训练神经网络无关的半监督学习技术。其中之一意味着使用标记数据构建模型,然后使用任何聚类技术将未标记和标记的示例聚类在一起(我们在第 9 章中考虑其中的一些技术)。对于每个新示例,我们将其所属集群中的多数标签输出为预测。
Other semi-supervised learning techniques, not related to training neural networks, exist. One of them implies building the model using the labeled data and then clustering the unlabeled and labeled examples together using any clustering technique (we consider some of them in Chapter 9). For each new example, we then output as a prediction the majority label in the cluster it belongs to.
另一种技术称为 S3VM,它基于使用 SVM。我们为未标记示例的每个可能标记构建一个 SVM 模型,然后选择具有最大余量的模型。关于 S3VM 的论文描述了一种方法,可以解决这个问题,而无需实际枚举所有可能的标签。
Another technique, called S3VM, is based on using SVM. We build one SVM model for each possible labeling of unlabeled examples and then we pick the model with the largest margin. The paper on S3VM describes an approach that allows solving this problem without actually enumerating all possible labelings.
如果不提及另外两个重要的监督学习范式,本章将是不完整的。其中之一是一次性学习。在通常应用于人脸识别的一次性学习中,我们希望建立一个模型,可以识别同一个人的两张照片代表同一个人。如果我们向模型展示两个不同人的两张照片,我们希望模型能够识别出这两个人是不同的。
This chapter would be incomplete without mentioning two other important supervised learning paradigms. One of them is one-shot learning. In one-shot learning, typically applied in face recognition, we want to build a model that can recognize that two photos of the same person represent that same person. If we present to the model two photos of two different people, we expect the model to recognize that the two people are different.
为了解决这个问题,我们可以采用传统的方式构建一个二元分类器,以两张图像作为输入并预测 true(当两张图片代表同一个人时)或 false(当两张图片属于不同的人时)。然而,在实践中,这将导致神经网络是典型神经网络的两倍,因为两张图片中的每一张都需要自己的嵌入子网络。训练这样的网络具有挑战性,不仅因为它的规模,而且因为正面的例子比负面的例子更难获得。所以这个问题是高度不平衡的。
To solve such a problem, we could go a traditional way and build a binary classifier that takes two images as input and predicts either true (when the two pictures represent the same person) or false (when the two pictures belong to different people). However, in practice, this would result in a neural network twice as big as a typical neural network, because each of the two pictures needs its own embedding subnetwork. Training such a network would be challenging not only because of its size but also because the positive examples would be much harder to obtain than negative ones. So the problem is highly imbalanced.
有效解决该问题的一种方法是训练孪生神经网络(SNN)。 SNN 可以实现为任何类型的神经网络:CNN、RNN 或 MLP。网络一次只接受一张图像作为输入;所以网络的规模并没有增加一倍。为了从仅以一张图片作为输入的网络中获得二元分类器“same_person”/“not_same”,我们以特殊的方式训练网络。
One way to effectively solve the problem is to train a siamese neural network (SNN). An SNN can be implemented as any kind of neural network, a CNN, an RNN, or an MLP. The network only takes one image as input at a time; so the size of the network is not doubled. To obtain a binary classifier “same_person”/“not_same” out of a network that only takes one picture as input, we train the networks in a special way.
为了训练 SNN,我们使用三元组损失函数。例如,让我们有一张人脸的三张图像:(对于锚),图像(对于正片)和图像(对于负数)。和是同一个人的两张不同照片;是另一个人的照片。每个训练示例现在是三胞胎。
To train an SNN, we use the triplet loss function. For example, let us have three images of a face: image (for anchor), image (for positive) and image (for negative). and are two different pictures of the same person; is a picture of another person. Each training example is now a triplet .
假设我们有一个神经网络模型可以将面部图片作为输入并输出该图片的嵌入。以三元组损失为例定义为,
Let’s say we have a neural network model that can take a picture of a face as input and output an embedding of this picture. The triplet loss for example is defined as,
成本函数定义为平均三元组损失:
The cost function is defined as the average triplet loss:
在哪里是一个正超参数。直观地说,当我们的神经网络输出相似的嵌入向量时较低和;当两个不同人的图片的嵌入不同时,该值较高。如果我们的模型按照我们想要的方式工作,那么术语总是负数,因为我们用小值减去高值。通过设置更高,我们强制该术语甚至更小,以确保模型学会以高裕度识别两张相同的面孔和两张不同的面孔。如果不够小,那么因为成本将为正,模型参数将在反向传播中调整。
where is a positive hyperparameter. Intuitively, is low when our neural network outputs similar embedding vectors for and ; is high when the embedding for pictures of two different people are different. If our model works the way we want, then the term will always be negative, because we subtract a high value from a small value. By setting higher, we force the term to be even smaller, to make sure that the model learned to recognize the two same faces and two different faces with a high margin. If is not small enough, then because of the cost will be positive, and the model parameters will be adjusted in backpropagation.
而不是随机选择图像,创建三元组进行训练的更好方法是在经过几个时期的学习后使用当前模型并找到候选者类似于和根据该模型。使用随机示例作为会显着减慢训练速度,因为神经网络很容易看到两个随机人的图片之间的差异,因此大多数时候平均三元组损失会很低,并且参数更新得不够快。
Rather than randomly choosing an image for , a better way to create triplets for training is to use the current model after several epochs of learning and find candidates for that are similar to and according to that model. Using random examples as would significantly slow down the training because the neural network will easily see the difference between pictures of two random people, so the average triplet loss will be low most of the time and the parameters will not be updated fast enough.
为了构建 SNN,我们首先决定神经网络的架构。例如,如果我们的输入是图像,则 CNN 是典型的选择。举一个例子,为了计算平均三元组损失,我们连续应用该模型,然后到,然后到,然后我们使用等式计算该示例的损失: 23 .我们对批次中的所有三元组重复此操作,然后计算成本;反向传播的梯度下降通过网络传播成本来更新其参数。
To build an SNN, we first decide on the architecture of our neural network. For example, CNN is a typical choice if our inputs are images. Given an example, to calculate the average triplet loss, we apply, consecutively, the model to , then to , then to , and then we compute the loss for that example using eq. 23. We repeat that for all triplets in the batch and then compute the cost; gradient descent with backpropagation propagates the cost through the network to update its parameters.
这是一种常见的误解,即对于一次性学习,我们只需要每个实体的一个示例进行训练。在实践中,为了使人员识别模型准确,我们需要每个人有多个示例。之所以称为一次性,是因为这种模型最常见的应用是:基于面部的身份验证。例如,这样的模型可用于解锁您的手机。如果你的模型好,那么你只需要手机上有一张你的照片,它就能认出你,而且它也会认出别人不是你。当我们有了模型后,来决定是否两张图片和属于同一个人,我们检查是否小于,一个超参数。
It’s a common misconception that for one-shot learning we need only one example of each entity for training. In practice, we need more than one example of each person for the person identification model to be accurate. It’s called one-shot because of the most frequent application of such a model: face-based authentication. For example, such a model could be used to unlock your phone. If your model is good, then you only need to have one picture of you on your phone and it will recognize you, and also it will recognize that someone else is not you. When we have the model, to decide whether two pictures and belong to the same person, we check if is less than , a hyperparameter.
我以零样本学习的方式完成了这一章。这是一个相对较新的研究领域,因此还没有任何算法被证明具有显着的实用性。因此,我在这里只概述基本思想,并将各种算法的细节留给进一步阅读。在零样本学习(ZSL)中,我们想要训练一个模型来为对象分配标签。最常见的应用是学习为图像分配标签。
I finish this chapter with zero-shot learning. It is a relatively new research area, so there are no algorithms that proved to have a significant practical utility yet. Therefore, I only outline here the basic idea and leave the details of various algorithms for further reading. In zero-shot learning (ZSL) we want to train a model to assign labels to objects. The most frequent application is to learn to assign labels to images.
然而,与标准分类相反,我们希望模型能够预测训练数据中没有的标签。这怎么可能?
However, contrary to standard classification, we want the model to be able to predict labels that we didn’t have in the training data. How is that possible?
诀窍是使用嵌入而不仅仅是表示输入还代表输出。想象一下,我们有一个模型,对于英语中的任何单词都可以生成具有以下属性的嵌入向量:如果一个单词与这个词有相似的含义,那么这两个单词的嵌入向量将相似。例如,如果是巴黎和是Rome,那么它们将具有相似的嵌入;另一方面,如果是马铃薯,那么嵌入和将会有所不同。这种嵌入向量称为词嵌入,通常使用余弦相似度度量1来比较它们。
The trick is to use embeddings not just to represent the input but also to represent the output . Imagine that we have a model that for any word in English can generate an embedding vector with the following property: if a word has a similar meaning to the word , then the embedding vectors for these two words will be similar. For example, if is Paris and is Rome, then they will have embeddings that are similar; on the other hand, if is potato, then the embeddings of and will be dissimilar. Such embedding vectors are called word embeddings, and they are usually compared using cosine similarity metrics1.
词嵌入具有这样的属性,即嵌入的每个维度都代表词含义的特定特征。例如,如果我们的词嵌入有四个维度(通常它们更宽,在 50 到 300 维之间),那么这四个维度可以代表诸如动物性、抽象性、酸味和黄色等含义特征(是的,听起来很有趣,但这只是一个例子)。所以“bee”这个词会有这样的嵌入,黄色这个词就像这样,独角兽这个词是这样的。每个嵌入的值是使用应用于庞大文本语料库的特定训练程序获得的。
Word embeddings have such a property that each dimension of the embedding represents a specific feature of the meaning of the word. For example, if our word embedding has four dimensions (usually they are much wider, between 50 and 300 dimensions), then these four dimensions could represent such features of the meaning as animalness, abstractness, sourness, and yellowness (yes, sounds funny, but it’s just an example). So the word bee would have an embedding like this , the word yellow like this , the word unicorn like this . The values for each embedding are obtained using a specific training procedure applied to a vast text corpus.
现在,在我们的分类问题中,我们可以替换标签对于每个例子在我们的训练集中使用其词嵌入并训练一个预测词嵌入的多标签模型。获取新示例的标签,我们应用我们的模型到,得到嵌入然后在所有英语单词中搜索那些嵌入最相似的单词使用余弦相似度。
Now, in our classification problem, we can replace the label for each example in our training set with its word embedding and train a multi-label model that predicts word embeddings. To get the label for a new example , we apply our model to , get the embedding and then search among all English words those whose embeddings are the most similar to using cosine similarity.
为什么这样有效?以斑马为例。它是白色的,是哺乳动物,有条纹。以小丑鱼为例:它是橙色的,不是哺乳动物,并且有条纹。现在以一只老虎为例:它是橙色的,有条纹,而且是哺乳动物。如果这三个特征出现在词嵌入中,CNN 将学会检测图片中的这些相同特征。即使训练数据中不存在老虎标签,但包含斑马和小丑鱼等其他物体,那么 CNN 很可能会学习哺乳动物、橙色和条纹的概念来预测这些物体的标签。一旦我们向模型呈现老虎的图片,这些特征就会从图像中正确识别,并且英语词典中与预测嵌入最接近的单词嵌入很可能就是“ tiger”。
Why does that work? Take a zebra for example. It is white, it is a mammal, and it has stripes. Take a clownfish: it is orange, not a mammal, and has stripes. Now take a tiger: it is orange, it has stripes, and it is a mammal. If these three features are present in word embeddings, the CNN would learn to detect these same features in pictures. Even if the label tiger was absent in the training data, but other objects including zebras and clownfish were, then the CNN will most likely learn the notion of mammalness, orangeness, and stripeness to predict labels of those objects. Once we present the picture of a tiger to the model, those features will be correctly identified from the image and most likely the closest word embedding from our English dictionary to the predicted embedding will be that of tiger.
本章包含对某些技术的描述,您可能会发现这些技术对您的实践很有用。之所以称为“高级实践”,并不是因为所介绍的技术更复杂,而是因为它们应用于一些非常具体的环境中。在许多实际情况下,您很可能不需要求助于使用这些技术,但有时它们非常有帮助。
This chapter contains the description of techniques that you could find useful in your practice in some contexts. It’s called “Advanced Practice” not because the presented techniques are more complex, but rather because they are applied in some very specific contexts. In many practical situations, you will most likely not need to resort to using these techniques, but sometimes they are very helpful.
通常在实践中,某些类别的示例在您的训练数据中代表性不足。例如,当您的分类器必须区分真实和欺诈性电子商务交易时,就是这种情况:真实交易的例子要频繁得多。如果您使用具有软间隔的 SVM,则可以为错误分类的示例定义成本。由于训练数据中始终存在噪声,因此许多真实交易的示例很可能会增加成本,从而最终出现在决策边界的错误一侧。
Often in practice, examples of some class will be underrepresented in your training data. This is the case, for example, when your classifier has to distinguish between genuine and fraudulent e-commerce transactions: the examples of genuine transactions are much more frequent. If you use SVM with soft margin, you can define a cost for misclassified examples. Because noise is always present in the training data, there are high chances that many examples of genuine transactions would end up on the wrong side of the decision boundary by contributing to the cost.
SVM算法尝试移动超平面以尽可能避免错误分类的示例。为了正确分类更多的多数类别的例子,占少数的“欺诈”例子有被错误分类的风险。这种情况如下图所示:
The SVM algorithm tries to move the hyperplane to avoid misclassified examples as much as possible. The “fraudulent” examples, which are in the minority, risk being misclassified in order to classify more numerous examples of the majority class correctly. This situation is illustrated below:
大多数应用于不平衡数据集的学习算法都会出现这个问题。
This problem is observed for most learning algorithms applied to imbalanced datasets.
如果将少数类示例的错误分类成本设置得较高,那么模型将更加努力地避免对这些示例进行错误分类,但这会产生多数类的某些示例的错误分类成本,如下所示:
If you set the cost of misclassification of examples of the minority class higher, then the model will try harder to avoid misclassifying those examples, but this will incur the cost of misclassification of some examples of the majority class, as illustrated below:
某些 SVM 实现允许您为每个类别提供权重。学习算法在寻找最佳超平面时会考虑这些信息。
Some SVM implementations allow you to provide weights for every class. The learning algorithm takes this information into account when looking for the best hyperplane.
如果学习算法不允许对类别进行加权,您可以尝试过采样技术。它包括通过制作某个类的示例的多个副本来提高该类示例的重要性。
If a learning algorithm doesn’t allow weighting classes, you can try the technique of oversampling. It consists of increasing the importance of examples of some class by making multiple copies of the examples of that class.
另一种相反的方法是欠采样,即从训练集中随机删除多数类别的一些示例。
An opposite approach, undersampling, is to randomly remove from the training set some examples of the majority class.
您还可以尝试通过随机采样少数类的几个示例的特征值并将它们组合起来以获得该类的新示例来创建合成示例。有两种流行的算法通过创建合成示例来对少数类进行过采样:合成少数过采样技术(SMOTE) 和自适应合成采样方法(ADASYN)。
You might also try to create synthetic examples by randomly sampling feature values of several examples of the minority class and combining them to obtain a new example of that class. There are two popular algorithms that oversample the minority class by creating synthetic examples: the synthetic minority oversampling technique (SMOTE) and the adaptive synthetic sampling method (ADASYN).
SMOTE 和 ADASYN 在许多方面的工作原理相似。对于给定的例子少数群体中,他们选择这个例子的最近邻居(让我们表示这组例子)然后创建一个综合示例作为, 在哪里是随机选择的少数群体的一个例子。插值超参数是范围内的随机数。
SMOTE and ADASYN work similarly in many ways. For a given example of the minority class, they pick nearest neighbors of this example (let’s denote this set of examples ) and then create a synthetic example as , where is an example of the minority class chosen randomly from . The interpolation hyperparameter is a random number in the range .
SMOTE 和 ADASYN 都随机选择所有可能的在数据集中。在 ADASYN 中,为每个生成的合成示例的数量与示例的数量成正比他们不属于少数阶层。因此,在少数类别的例子很少的领域会产生更多的综合例子。
Both SMOTE and ADASYN randomly pick all possible in the dataset. In ADASYN, the number of synthetic examples generated for each is proportional to the number of examples in which are not from the minority class. Therefore, more synthetic examples are generated in the area where the examples of the minority class are rare.
有些算法对不平衡数据集的问题不太敏感。决策树以及随机森林和梯度提升通常在不平衡数据集上表现良好。
Some algorithms are less sensitive to the problem of an imbalanced dataset. Decision trees, as well as random forest and gradient boosting, often perform well on imbalanced datasets.
集成算法(如随机森林)通常结合具有相同性质的模型。它们通过组合数百个弱模型来提高性能。在实践中,我们有时可以通过结合使用不同学习算法创建的强大模型来获得额外的性能增益。在这种情况下,我们通常只使用两个或三个模型。
Ensemble algorithms, like Random Forest, typically combine models of the same nature. They boost performance by combining hundreds of weak models. In practice, we can sometimes get an additional performance gain by combining strong models made with different learning algorithms. In this case, we usually use only two or three models.
组合模型的三种典型方法是 1) 平均、2) 多数投票和 3) 堆叠。
Three typical ways to combine models are 1) averaging, 2) majority vote and 3) stacking.
平均适用于回归以及返回分类分数的分类模型。您只需将所有模型(我们称之为基本模型)应用到输入然后对预测进行平均。要查看平均模型是否比每个单独的算法效果更好,您可以使用您选择的指标在验证集上对其进行测试。
Averaging works for regression as well as those classification models that return classification scores. You simply apply all your models—let’s call them base models—to the input and then average the predictions. To see if the averaged model works better than each individual algorithm, you test it on the validation set using a metric of your choice.
多数投票适用于分类模型。您将所有基本模型应用于输入然后返回所有预测中的多数类。在平局的情况下,您可以随机选择其中一个类,或者返回一条错误消息(如果错误分类的事实会产生重大成本)。
Majority vote works for classification models. You apply all your base models to the input and then return the majority class among all predictions. In the case of a tie, you either randomly pick one of the classes, or, you return an error message (if the fact of misclassifying would incur a significant cost).
堆叠包括构建一个元模型,该元模型将基本模型的输出作为输入。假设您想组合分类器和,都预测同一组类。创建训练示例对于堆叠模型,设置和。
Stacking consists of building a meta-model that takes the output of base models as input. Let’s say you want to combine classifiers and , both predicting the same set of classes. To create a training example for the stacked model, set and .
如果您的某些基本模型不仅返回一个类别,还返回每个类别的分数,您也可以使用这些值作为特征。
If some of your base models return not just a class, but also a score for each class, you can use these values as features too.
要训练堆叠模型,建议使用训练集中的示例并使用交叉验证调整堆叠模型的超参数。
To train the stacked model, it is recommended to use examples from the training set and tune the hyperparameters of the stacked model using cross-validation.
显然,您必须确保堆叠模型在验证集上的性能优于堆叠的每个基本模型。
Obviously, you have to make sure that your stacked model performs better on the validation set than each of the base models you stacked.
组合多个模型之所以能够带来更好的性能,是因为当几个不相关的强模型达成一致时,它们更有可能就正确的结果达成一致。这里的关键词是“不相关”。理想情况下,应该使用不同的特征或使用不同性质的算法来获得基础模型——例如支持向量机和随机森林。组合不同版本的决策树学习算法或多个具有不同超参数的 SVM 可能不会带来显着的性能提升。
The reason that combining multiple models can bring better performance is that when several uncorrelated strong models agree they are more likely to agree on the correct outcome. The keyword here is “uncorrelated.” Ideally, base models should be obtained using different features or using algorithms of a different nature — for example, SVMs and Random Forest. Combining different versions of the decision tree learning algorithm, or several SVMs with different hyperparameters, may not result in a significant performance boost.
在神经网络训练中,一个具有挑战性的方面是如何将数据转换为网络可以使用的输入。如果您的输入是图像,首先,您必须调整所有图像的大小,使它们具有相同的尺寸。之后,通常首先对像素进行标准化,然后将其归一化到范围。
In neural network training, one challenging aspect is how to convert your data into the input the network can work with. If your input is images, first of all, you have to resize all images so that they have the same dimensions. After that, pixels are usually first standardized and then normalized to the range .
文本必须被标记化(即分成多个片段,例如单词、标点符号和其他符号)。对于CNN和RNN,每个token都使用one-hot编码转换为一个向量,因此文本变成了one-hot向量的列表。另一种通常更好的表示标记的方法是使用词嵌入。对于多层感知器,为了将文本转换为向量,词袋方法可能效果很好,特别是对于较大的文本(大于短信和推文)。
Texts have to be tokenized (that is, split into pieces, such as words, punctuation marks, and other symbols). For CNN and RNN, each token is converted into a vector using the one-hot encoding, so the text becomes a list of one-hot vectors. Another, often better way to represent tokens is by using word embeddings. For a multilayer perceptron, to convert texts to vectors the bag of words approach may work well, especially for larger texts (larger than SMS messages and tweets).
选择特定的神经网络架构是一件困难的事情。对于同一个问题,例如seq2seq学习,有多种架构,并且几乎每年都会提出新的架构。我建议使用 Google Scholar 或 Microsoft Academy 搜索引擎研究最先进的解决方案来解决您的问题,这些搜索引擎允许使用关键字和时间范围搜索科学出版物。如果您不介意使用不太现代的架构,我建议您在 GitHub 上寻找已实现的架构,并找到一种只需稍作修改即可应用于您的数据的架构。
The choice of specific neural network architecture is a difficult one. For the same problem, like seq2seq learning, there is a variety of architectures, and new ones are proposed almost every year. I recommend researching state of the art solutions for your problem using Google Scholar or Microsoft Academic search engines that allow searching for scientific publications using keywords and time range. If you don’t mind working with less modern architecture, I recommend looking for implemented architectures on GitHub and finding one that could be applied to your data with minor modifications.
在实践中,当您预处理、清理和标准化数据以及创建更大的训练集时,现代架构相对于旧架构的优势变得不那么重要。现代神经网络架构是来自多个实验室和公司的科学家合作的结果;自行实现此类模型可能非常复杂,并且通常需要大量计算能力来训练。花时间试图复制最近一篇科学论文的结果可能不值得。这段时间最好花在围绕不太现代但稳定的模型构建解决方案并获取更多训练数据上。
In practice, the advantage of a modern architecture over an older one becomes less significant as you preprocess, clean and normalize your data, and create a larger training set. Modern neural network architectures are a result of the collaboration of scientists from several labs and companies; such models could be very complex to implement on your own and usually require much computational power to train. Time spent trying to replicate results from a recent scientific paper may not be worth it. This time could better be spent on building the solution around a less modern but stable model and getting more training data.
一旦决定了网络的架构,您就必须决定层数、类型和大小。建议从一层或两层开始,训练模型并查看它是否很好地拟合训练数据(偏差较低)。如果不是,则逐渐增加每层的大小和层数,直到模型完美拟合训练数据。在这种情况下,如果模型在验证数据上表现不佳(具有较高方差),您应该向模型添加正则化。如果添加正则化后,模型不再适合训练数据,请稍微增加网络的大小。继续迭代,直到模型根据您的指标充分拟合训练和验证数据。
Once you decided on the architecture of your network, you have to decide on the number of layers, their type, and size. It is recommended to start with one or two layers, train a model and see if it fits the training data well (has a low bias). If not, gradually increase the size of each layer and the number of layers until the model perfectly fits the training data. Once this is the case, if the model doesn’t perform well on the validation data (has a high variance), you should add regularization to your model. If, after adding regularization, the model doesn’t fit the training data anymore, slightly increase the size of the network. Continue iteratively until the model fits both training and validation data well enough according to your metric.
在神经网络中,除了 L1 和 L2 正则化之外,您还可以使用神经网络特定的正则化器:dropout、earlystopping和batchnormalization。后者从技术上讲并不是一种正则化技术,但它往往对模型具有正则化效果。
In neural networks, besides L1 and L2 regularization, you can use neural network specific regularizers: dropout, early stopping, and batch normalization. The latter is technically not a regularization technique, but it often has a regularization effect on the model.
Dropout的概念非常简单。每次通过网络运行训练示例时,您都会暂时从计算中随机排除一些单元。排除的单位百分比越高,正则化效果越高。神经网络库允许您在两个连续层之间添加 dropout 层,或者您可以为该层指定 dropout 参数。 dropout参数在范围内并且必须通过根据验证数据进行调整来通过实验找到它。
The concept of dropout is very simple. Each time you run a training example through the network, you temporarily exclude at random some units from the computation. The higher the percentage of units excluded the higher the regularization effect. Neural network libraries allow you to add a dropout layer between two successive layers, or you can specify the dropout parameter for the layer. The dropout parameter is in the range and it has to be found experimentally by tuning it on the validation data.
早期停止是通过在每个时期后保存初步模型并评估初步模型在验证集上的性能来训练神经网络的方法。正如您在第 4 章有关梯度下降的部分中所记得的那样,随着 epoch 数量的增加,成本会降低。成本的降低意味着模型能够很好地拟合训练数据。然而,在某个时刻,某个时代之后,模型可能开始过度拟合:成本不断下降,但模型在验证数据上的性能恶化。如果您将每个时期之后模型的版本保存在文件中,那么一旦开始观察到验证集的性能下降,您就可以停止训练。或者,您可以继续运行固定数量的训练过程,然后最终选择最佳模型。每个时期之后保存的模型称为检查点。一些机器学习从业者经常依赖这种技术;其他人尝试适当地规范模型以避免这种不良行为。
Early stopping is the way to train a neural network by saving the preliminary model after every epoch and assessing the performance of the preliminary model on the validation set. As you remember from the section about gradient descent in Chapter 4, as the number of epochs increases, the cost decreases. The decreased cost means that the model fits the training data well. However, at some point, after some epoch , the model can start overfitting: the cost keeps decreasing, but the performance of the model on the validation data deteriorates. If you keep, in a file, the version of the model after each epoch, you can stop the training once you start observing a decreased performance on the validation set. Alternatively, you can keep running the training process for a fixed number of epochs and then, in the end, you pick the best model. Models saved after each epoch are called checkpoints. Some machine learning practitioners rely on this technique very often; others try to properly regularize the model to avoid such an undesirable behavior.
批量归一化(必须称为批量标准化)是一种技术,包括在后续层的单元接收每层的输出作为输入之前对其进行标准化。在实践中,批量归一化可以带来更快、更稳定的训练,以及一些正则化效果。因此,尝试使用批量标准化总是一个好主意。在神经网络库中,您通常可以在两层之间插入批量归一化层。
Batch normalization (which rather has to be called batch standardization) is a technique that consists of standardizing the outputs of each layer before the units of the subsequent layer receive them as input. In practice, batch normalization results in faster and more stable training, as well as some regularization effect. So it’s always a good idea to try to use batch normalization. In neural network libraries, you can often insert a batch normalization layer between two layers.
另一种正则化技术不仅可以应用于神经网络,而且可以应用于几乎任何学习算法,称为数据增强。该技术通常用于规范处理图像的模型。获得原始标记训练集后,您可以通过对原始图像应用各种变换,从原始示例创建合成示例:稍微缩放、旋转、翻转、变暗等。您可以在这些合成示例中保留原始标签。在实践中,这通常会提高模型的性能。
Another regularization technique that can be applied not just to neural networks, but to virtually any learning algorithm, is called data augmentation. This technique is often used to regularize models that work with images. Once you have your original labeled training set, you can create a synthetic example from an original example by applying various transformations to the original image: zooming it slightly, rotating, flipping, darkening, and so on. You keep the original label in these synthetic examples. In practice, this often results in increased performance of the model.
在实践中,您通常会使用多模式数据。例如,您的输入可以是图像和文本,二进制输出可以指示文本是否描述了该图像。
Often in practice, you will work with multimodal data. For example, your input could be an image and text and the binary output could indicate whether the text describes this image.
很难使浅层学习算法适应多模态数据。然而,这并非不可能。您可以在图像上训练一个浅层模型,在文本上训练另一个模型。然后您可以使用我们上面讨论的模型组合技术。
It’s hard to adapt shallow learning algorithms to work with multimodal data. However, it’s not impossible. You could train one shallow model on the image and another one on the text. Then you can use a model combination technique we discussed above.
如果您无法将问题划分为两个独立的子问题,则可以尝试对每个输入进行向量化(通过应用相应的特征工程方法),然后简单地将两个特征向量连接在一起以形成一个更宽的特征向量。例如,如果您的图像具有以下特征并且你的文字有特点你的连接特征向量将是。
If you cannot divide your problem into two independent subproblems, you can try to vectorize each input (by applying the corresponding feature engineering method) and then simply concatenate two feature vectors together to form one wider feature vector. For example, if your image has features and your text has features your concatenated feature vector will be .
使用神经网络,您将拥有更大的灵活性。您可以构建两个子网络,一个用于每种类型的输入。例如,CNN 子网络将读取图像,而 RNN 子网络将读取文本。两个子网络的最后一层都有嵌入:CNN 具有图像嵌入,而 RNN 具有文本嵌入。现在,您可以连接两个嵌入,然后在连接的嵌入之上添加一个分类层,例如 softmax 或 sigmoid。神经网络库提供了易于使用的工具,允许连接或平均来自多个子网络的层。
With neural networks, you have more flexibility. You can build two subnetworks, one for each type of input. For example, a CNN subnetwork would read the image while an RNN subnetwork would read the text. Both subnetworks have as their last layer an embedding: CNN has an embedding of the image, while RNN has an embedding of the text. You can now concatenate two embeddings and then add a classification layer, such as softmax or sigmoid, on top of the concatenated embeddings. Neural network libraries provide simple-to-use tools that allow concatenating or averaging of layers from several subnetworks.
在某些问题中,您希望预测一个输入的多个输出。我们在上一章中考虑了多标签分类。一些具有多个输出的问题可以有效地转化为多标签分类问题。特别是那些具有相同性质的标签(如标签)或假标签的标签可以创建为原始标签组合的完整枚举。
In some problems, you would like to predict multiple outputs for one input. We considered multi-label classification in the previous chapter. Some problems with multiple outputs can be effectively converted into a multi-label classification problem. Especially those that have labels of the same nature (like tags) or fake labels can be created as a full enumeration of combinations of original labels.
然而,在某些情况下,输出是多模态的,并且无法有效地枚举它们的组合。考虑以下示例:您想要构建一个模型来检测图像上的对象并返回其坐标。此外,模型必须返回描述对象的标签,例如“人”、“猫”或“仓鼠”。您的训练示例将是表示图像的特征向量。标签将表示为对象坐标向量和带有单热编码标签的另一个向量。
However, in some cases the outputs are multimodal, and their combinations cannot be effectively enumerated. Consider the following example: you want to build a model that detects an object on an image and returns its coordinates. In addition, the model has to return a tag describing the object, such as “person,” “cat,” or “hamster.” Your training example will be a feature vector that represents an image. The label will be represented as a vector of coordinates of the object and another vector with a one-hot encoded tag.
为了处理这种情况,您可以创建一个用作编码器的子网络。它将使用例如一个或多个卷积层来读取输入图像。编码器的最后一层是图像的嵌入。然后,在嵌入层顶部添加另外两个子网络:一个将嵌入向量作为输入并预测对象的坐标。第一个子网络可以将 ReLU 作为最后一层,这对于预测正实数(例如坐标)来说是一个不错的选择;该子网络可以使用均方误差成本。第二个子网络将采用相同的嵌入向量作为输入并预测每个标签的概率。第二个子网络可以有一个 softmax 作为最后一层,它适合概率输出,并使用平均负对数似然成本(也称为交叉熵成本)。
To handle a situation like that, you can create one subnetwork that would work as an encoder. It will read the input image using, for example, one or several convolution layers. The encoder’s last layer would be the embedding of the image. Then you add two other subnetworks on top of the embedding layer: one that takes the embedding vector as input and predicts the coordinates of an object. This first subnetwork can have a ReLU as the last layer, which is a good choice for predicting positive real numbers, such as coordinates; this subnetwork could use the mean squared error cost . The second subnetwork will take the same embedding vector as input and predict the probabilities for each label. This second subnetwork can have a softmax as the last layer, which is appropriate for the probabilistic output, and use the averaged negative log-likelihood cost (also called cross-entropy cost).
显然,您对准确预测的坐标和标签都感兴趣。然而,同时优化两个成本函数是不可能的。通过尝试优化其中一个,您可能会损害第二个,反之亦然。您可以做的是添加另一个超参数在范围中并将组合成本函数定义为。然后你调整值就像任何其他超参数一样在验证数据上。
Obviously, you are interested in both accurately predicted coordinates and the label. However, it is impossible to optimize the two cost functions at the same time. By trying to optimize one, you risk hurting the second one and the other way around. What you can do is add another hyperparameter in the range and define the combined cost function as . Then you tune the value for on the validation data just like any other hyperparameter.
迁移学习可能是神经网络相对于浅层模型具有独特优势的地方。在迁移学习中,您选择在某些数据集上训练的现有模型,然后调整该模型以预测另一个数据集中的示例,该数据集不同于构建该模型的数据集。第二个数据集与您用于验证和测试的保留集不同。它可能代表一些其他现象,或者正如机器学习科学家所说,它可能来自另一种统计分布。
Transfer learning is probably where neural networks have a unique advantage over the shallow models. In transfer learning, you pick an existing model trained on some dataset, and you adapt this model to predict examples from another dataset, different from the one the model was built on. This second dataset is not like holdout sets you use for validation and test. It may represent some other phenomenon, or, as machine learning scientists say, it may come from another statistical distribution.
例如,假设您已经训练模型在大型标记数据集中识别(并标记)野生动物。一段时间后,您还有另一个问题需要解决:您需要建立一个可以识别家畜的模型。使用浅层学习算法,您没有太多选择:您必须构建另一个大型标记数据集,现在用于家养动物。
For example, imagine you have trained your model to recognize (and label) wild animals on a big labeled dataset. After some time, you have another problem to solve: you need to build a model that would recognize domestic animals. With shallow learning algorithms, you do not have many options: you have to build another big labeled dataset, now for domestic animals.
对于神经网络来说,情况要有利得多。神经网络中的迁移学习的工作原理如下:
With neural networks, the situation is much more favorable. Transfer learning in neural networks works like this:
通常,网上有大量针对视觉问题的深度模型。您可以找到一个很有可能对您的问题有用的模型,下载该模型,删除最后几层(要删除的层数是一个超参数),添加您自己的预测层并训练您的模型。
Usually, there is an abundance of deep models for visual problems available online. You can find one that has high chances to be of use for your problem, download that model, remove several last layers (the quantity of layers to remove is a hyperparameter), add your own prediction layers and train your model.
即使您没有现有模型,当您的问题需要获取成本非常高的标记数据集时,迁移学习仍然可以为您提供帮助,但您可以获得另一个更容易获得标签的数据集。假设您构建了一个文档分类模型。您从雇主那里获得了标签分类,其中包含一千个类别。在这种情况下,您需要付费请某人 a) 阅读、理解并记住类别之间的差异,b) 阅读多达一百万个文档并对其进行注释。
Even if you don’t have an existing model, transfer learning can still help you in situations when your problem requires a labeled dataset that is very costly to obtain, but you can get another dataset for which labels are more readily available. Let’s say you build a document classification model. You got the taxonomy of labels from your employer, and it contains a thousand categories. In this case, you would need to pay someone to a) read, understand and memorize the differences between categories and b) read up to a million documents and annotate them.
为了节省标记如此多的示例,您可以考虑使用维基百科页面作为数据集来构建您的第一个模型。维基百科页面的标签可以通过获取维基百科页面所属的类别来自动获取。一旦您的第一个模型学会了预测维基百科类别,您就可以“微调”该模型以预测雇主分类法的类别。与从头开始解决原始问题相比,您需要的雇主问题的带注释示例要少得多。
To save on labeling so many examples, you could consider using Wikipedia pages as the dataset to build your first model. The labels for a Wikipedia page can be obtained automatically by taking the category the Wikipedia page belongs to. Once your first model has learned to predict Wikipedia categories, you can “fine tune” this model to predict the categories of your employer’s taxonomy. You will need much fewer annotated examples for your employer’s problem than you would need if you started solving your original problem from scratch.
并非所有能够解决问题的算法都是实用的。有些可能太慢。有些问题可以通过快速算法解决;对于其他人来说,不存在快速算法。
Not all algorithms capable of solving a problem are practical. Some can be too slow. Some problems can be solved by a fast algorithm; for others, no fast algorithms can exist.
计算机科学的子领域称为算法分析,涉及确定和比较算法的复杂性。Big O 表示法用于根据算法的运行时间或空间需求如何随着输入大小的增长而增长来对算法进行分类。
The subfield of computer science called analysis of algorithms is concerned with determining and comparing the complexity of algorithms. Big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows.
例如,假设我们有一个问题,就是在示例集中找到两个距离最远的一维示例尺寸的。我们可以设计一种算法来解决这个问题,如下所示(此处和下文,用 Python 编写):
For example, let’s say we have the problem of finding the two most distant one-dimensional examples in the set of examples of size . One algorithm we could craft to solve this problem would look like this (here and below, in Python):
def find_max_distance(S):
result = None
max_distance = 0
for x1 in S:
for x2 in S:
if abs(x1 - x2) >= max_distance:
max_distance = abs(x1 - x2)
result = (x1, x2)
return resultdef find_max_distance(S):
result = None
max_distance = 0
for x1 in S:
for x2 in S:
if abs(x1 - x2) >= max_distance:
max_distance = abs(x1 - x2)
result = (x1, x2)
return result在上面的算法中,我们循环遍历中的所有值,并且在第一个循环的每次迭代中,我们循环遍历中的所有值再次。因此,上述算法使得数字的比较。如果我们把时间作为一个单位时间,和运算,那么该算法的时间复杂度(或者简单地说,复杂度)最多为。 (在每次迭代中,我们有一个, 二和两个当在最坏情况下测量算法的复杂性时,使用大 O 表示法。对于上面的算法,使用大O表示法,我们写出算法的复杂度为;常数,例如,被忽略。
In the above algorithm, we loop over all values in , and at every iteration of the first loop, we loop over all values in once again. Therefore, the above algorithm makes comparisons of numbers. If we take as a unit time the time the , and operations take, then the time complexity (or, simply, complexity) of this algorithm is at most . (At each iteration, we have one , two and two operations.) When the complexity of an algorithm is measured in the worst case, big O notation is used. For the above algorithm, using big O notation, we write that the algorithm’s complexity is ; the constants, like , are ignored.
对于同样的问题,我们可以设计另一种算法,如下所示:
For the same problem, we can craft another algorithm like this:
def find_max_distance(S):
result = None
min_x = float("inf")
max_x = float("-inf")
for x in S:
if x < min_x:
min_x = x
if x > max_x:
max_x = x
result = (max_x, min_x)
return resultdef find_max_distance(S):
result = None
min_x = float("inf")
max_x = float("-inf")
for x in S:
if x < min_x:
min_x = x
if x > max_x:
max_x = x
result = (max_x, min_x)
return result在上面的算法中,我们循环遍历中的所有值只执行一次,所以算法的复杂度为。在这种情况下,我们说后一种算法比前一种算法更有效。
In the above algorithm, we loop over all values in only once, so the algorithm’s complexity is . In this case, we say that the latter algorithm is more efficient than the former.
当算法的复杂度是输入大小的多项式时,该算法被称为高效算法。因此两者和是有效的,因为是一个次数多项式和是一个次数多项式。然而,对于非常大的输入,算法可能会很慢。在大数据时代,科学家经常寻找算法。
An algorithm is called efficient when its complexity is polynomial in the size of the input. Therefore both and are efficient because is a polynomial of degree and is a polynomial of degree . However, for very large inputs, an algorithm can be slow. In the big data era, scientists often look for algorithms.
从实际的角度来看,当您实现算法时,应尽可能避免使用循环。例如,您应该使用矩阵和向量的运算,而不是循环。在Python中,计算,你应该写,
From a practical standpoint, when you implement your algorithm, you should avoid using loops whenever possible. For example, you should use operations on matrices and vectors, instead of loops. In Python, to compute , you should write,
并不是,
and not,
使用适当的数据结构。如果集合中元素的顺序不重要,请使用代替。 Python中验证特定示例是否存在的操作属于当被声明为当被声明为。
Use appropriate data structures. If the order of elements in a collection doesn’t matter, use instead of . In Python, the operation of verifying whether a specific example belongs to is efficient when is declared as a and is inefficient when is declared as a .
另一个可以用来提高 Python 代码效率的重要数据结构是。在其他语言中它被称为字典或哈希图。它允许您通过非常快速的键查找来定义键值对的集合。
Another important data structure that you can use to make your Python code more efficient is . It is called a dictionary or a hashmap in other languages. It allows you to define a collection of key-value pairs with very fast lookups for keys.
除非您确切地知道自己在做什么,否则总是更喜欢使用流行的库而不是编写自己的科学代码。 numpy、scipy 和 scikit-learn 等科学 Python 包是由经验丰富的科学家和工程师以效率为中心构建的。他们有许多用 C 编程语言实现的方法,以实现最大效率。
Unless you know exactly what you do, always prefer using popular libraries to writing your own scientific code. Scientific Python packages like numpy, scipy, and scikit-learn were built by experienced scientists and engineers with efficiency in mind. They have many methods implemented in the C programming language for maximum efficiency.
如果需要迭代大量元素,请使用生成器来创建一次返回一个元素而不是一次返回所有元素的函数。
If you need to iterate over a vast collection of elements, use generators that create a function that returns one element at a time rather than all the elements at once.
使用Python 中的cProfile包来查找代码中的低效率问题。
Use the cProfile package in Python to find inefficiencies in your code.
最后,当从算法的角度来看您的代码无法改进时,您可以通过使用以下方法进一步提高代码的速度:
Finally, when nothing can be improved in your code from the algorithmic perspective, you can further boost the speed of your code by using:
多处理包并行运行计算,以及
multiprocessing package to run computations in parallel, and
PyPy、Numba或类似工具可将 Python 代码编译为快速、优化的机器代码。
PyPy, Numba or similar tools to compile your Python code into fast, optimized machine code.
无监督学习处理数据没有标签的问题。这一特性对于许多应用来说都是一个很大的问题。缺乏代表模型所需行为的标签意味着缺乏可靠的参考点来判断模型的质量。在本书中,我仅介绍无监督学习方法,这些方法允许构建可以根据数据而不是人类判断进行评估的模型。
Unsupervised learning deals with problems in which data doesn’t have labels. That property makes it very problematic for many applications. The absence of labels representing the desired behavior for your model means the absence of a solid reference point to judge the quality of your model. In this book, I only present unsupervised learning methods that allow the building of models that can be evaluated based on data as opposed to human judgment.
密度估计是对从中提取数据集的未知概率分布的概率密度函数 (pdf) 进行建模的问题。它可用于许多应用,特别是新颖性或入侵检测。在第 7 章中,我们已经估计了 pdf 来解决一类分类问题。为此,我们决定我们的模型将是参数化的,更准确地说是多元正态分布 (MVN)。这个决定有些武断,因为如果我们的数据集绘制的真实分布与 MVN 不同,我们的模型很可能远非完美。我们还知道模型可以是非参数的。我们在核回归中使用了非参数模型。事实证明,同样的方法也适用于密度估计。
Density estimation is a problem of modeling the probability density function (pdf) of the unknown probability distribution from which the dataset has been drawn. It can be useful for many applications, in particular for novelty or intrusion detection. In Chapter 7, we already estimated the pdf to solve the one-class classification problem. To do that, we decided that our model would be parametric, more precisely a multivariate normal distribution (MVN). This decision was somewhat arbitrary because if the real distribution from which our dataset was drawn is different from the MVN, our model will be very likely far from perfect. We also know that models can be nonparametric. We used a nonparametric model in kernel regression. It turns out that the same approach can work for density estimation.
让是一个一维数据集(多维情况类似),其示例是从具有未知 pdf 的分布中抽取的和对全部。我们对建模的形状感兴趣。我们的内核模型,表示为, 是(谁)给的,
Let be a one-dimensional dataset (a multi-dimensional case is similar) whose examples were drawn from a distribution with an unknown pdf with for all . We are interested in modeling the shape of . Our kernel model of , denoted as , is given by,
在哪里是一个超参数,控制我们模型的偏差和方差之间的权衡是一个内核。再次,像第 7 章一样,我们使用高斯核:
where is a hyperparameter that controls the tradeoff between bias and variance of our model and is a kernel. Again, like in Chapter 7, we use a Gaussian kernel:
我们寻找这样一个值最大限度地减少实际形状之间的差异和我们模型的形状。这种差异的合理选择称为均方误差(MISE):
We look for such a value of that minimizes the difference between the real shape of and the shape of our model . A reasonable choice of measure of this difference is called the mean integrated squared error (MISE):
直观上,你可以在等式中看到。 25我们对真实 pdf 之间的差进行平方以及我们的模型。积分代替求和我们采用均方误差,而期望算子取代平均值。
Intuitively, you see in eq. 25 that we square the difference between the real pdf and our model of it . The integral replaces the summation we employed in the mean squared error, while the expectation operator replaces the average .
事实上,当我们的损失是一个具有连续域的函数时,例如,我们必须用积分代替求和。期望操作意味着我们想要对于我们的训练集的所有可能实现都是最佳的。这很重要,因为是在某个概率分布的有限样本上定义的,而真实的 pdf定义在无限域上(集合)。
Indeed, when our loss is a function with a continuous domain, such as , we have to replace the summation with the integral. The expectation operation means that we want to be optimal for all possible realizations of our training set . That is important because is defined on a finite sample of some probability distribution, while the real pdf is defined on an infinite domain (the set ).
现在,我们可以重写等式右侧的项。 25 个这样的:
Now, we can rewrite the right-hand side term in eq. 25 like this:
上述求和中的第三项独立于因此可以忽略不计。第一项的无偏估计量由下式给出而第二项的无偏估计量可以通过交叉验证来近似 , 在哪里是一个内核模型使用示例根据我们的训练集进行计算排除。
The third term in the above summation is independent of and thus can be ignored. An unbiased estimator of the first term is given by while the unbiased estimator of the second term can be approximated by cross-validation , where is a kernel model of computed on our training set with the example excluded.
期限在统计学中被称为留一估计,这是一种交叉验证的形式,其中每次折叠都包含一个示例。您可能已经注意到这个词(我们称之为) 是函数的期望值, 因为是一个pdf文件。可以证明留一估计是一个无偏估计。
The term is known in statistics as the leave one out estimate, a form of cross-validation in which each fold consists of one example. You could have noticed that the term (let’s call it ) is the expected value of the function , because is a pdf. It can be demonstrated that the leave one out estimate is an unbiased estimator of .
现在,寻找最优值为了,我们最小化成本定义为,
Now, to find the optimal value for , we minimize the cost defined as,
我们可以找使用网格搜索。为了维特征向量,误差项在等式中 24可以用欧氏距离代替。在图中。 42-图。 44您可以看到使用三个不同值获得的相同 pdf 的估计值从一个-示例数据集。
We can find using grid search. For -dimensional feature vectors , the error term in eq. 24 can be replaced by the Euclidean distance . In fig. 42-fig. 44 you can see the estimates for the same pdf obtained with three different values of from a -example dataset.
对应的网格搜索曲线如下所示:
The corresponding grid search curve is shown below:
我们挑选位于网格搜索曲线的最小值。
We pick at the minimum of the grid search curve.
聚类是学习通过利用未标记的数据集为示例分配标签的问题。由于数据集完全未标记,因此决定学习模型是否最优比监督学习要复杂得多。
Clustering is a problem of learning to assign a label to examples by leveraging an unlabeled dataset. Because the dataset is completely unlabeled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.
聚类算法有很多种,不幸的是,很难判断哪一种算法更适合您的数据集。通常,每种算法的性能取决于数据集得出的概率分布的未知属性。在本章中,我概述了最有用和最广泛使用的聚类算法。
There is a variety of clustering algorithms, and, unfortunately, it’s hard to tell which one is better in quality for your dataset. Usually, the performance of each algorithm depends on the unknown properties of the probability distribution that the dataset was drawn from. In this Chapter, I outline the most useful and widely used clustering algorithms.
k -means聚类算法的工作原理如下。首先,你选择——簇的数量。然后你随机放特征向量(称为质心)到特征空间。
The k-means clustering algorithm works as follows. First, you choose — the number of clusters. Then you randomly put feature vectors, called centroids, to the feature space.
然后我们计算与每个示例的距离到每个质心使用一些度量,例如欧几里德距离。然后我们为每个示例分配最近的质心(就像我们用质心 id 作为标签来标记每个示例一样)。对于每个质心,我们计算用它标记的示例的平均特征向量。这些平均特征向量成为质心的新位置。
We then compute the distance from each example to each centroid using some metric, like the Euclidean distance. Then we assign the closest centroid to each example (like if we labeled each example with a centroid id as the label). For each centroid, we calculate the average feature vector of the examples labeled with it. These average feature vectors become the new locations of the centroids.
我们重新计算每个示例到每个质心的距离,修改分配并重复该过程,直到重新计算质心位置后分配不再改变。该模型是示例的质心 ID 分配列表。
We recompute the distance from each example to each centroid, modify the assignment and repeat the procedure until the assignments don’t change after the centroid locations were recomputed. The model is the list of assignments of centroids IDs to the examples.
质心的初始位置影响最终位置,因此两次运行 k 均值可能会产生两个不同的模型。 k 均值的某些变体根据数据集的某些属性计算质心的初始位置。
The initial position of centroids influence the final positions, so two runs of k-means can result in two different models. Some variants of k-means compute the initial positions of centroids based on some properties of the dataset.
k 均值算法的一次运行如下所示:
One run of the k-means algorithm is illustrated below:
上图中的圆圈是二维特征向量;正方形的质心正在移动。不同的背景颜色代表所有点属于同一簇的区域。
The circles in the above figure are two-dimensional feature vectors; the squares are moving centroids. Different background colors represent regions in which all points belong to the same cluster.
的价值,集群的数量,是一个必须由数据分析师调整的超参数。有一些选择技巧。它们都没有被证明是最佳的。大多数这些技术要求分析师通过查看一些指标或通过直观地检查集群分配来做出“有根据的猜测”。在本章中,我提出了一种为无需查看数据并进行猜测。
The value of , the number of clusters, is a hyperparameter that has to be tuned by the data analyst. There are some techniques for selecting . None of them is proven optimal. Most of those techniques require the analyst to make an “educated guess” by looking at some metrics or by examining cluster assignments visually. In this chapter, I present one approach to choose a reasonably good value for without looking at the data and making guesses.
k 均值和类似算法是基于质心的,而DBSCAN是基于密度的聚类算法。通过使用 DBSCAN,您无需猜测需要多少个集群,而是定义两个超参数:和。你首先选择一个例子随机从您的数据集中并将其分配给集群。然后计算有多少个例子距离小于或等于。如果该数量大于或等于,然后你把所有这些- 同一集群的邻居。然后检查集群的每个成员并找到各自的-邻居。如果集群中的某个成员有或者更多-邻居,你扩展集群通过添加那些- 集群的邻居。您继续扩展集群直到没有更多的例子可以放入为止。在后一种情况下,您从数据集中选择不属于任何集群的另一个示例并将其放入集群中。继续这样,直到所有示例都属于某个集群或被标记为异常值。异常值是一个例子,其- 邻里包含少于例子。
While k-means and similar algorithms are centroid-based, DBSCAN is a density-based clustering algorithm. Instead of guessing how many clusters you need, by using DBSCAN, you define two hyperparameters: and . You start by picking an example from your dataset at random and assign it to cluster . Then you count how many examples have the distance from less than or equal to . If this quantity is greater than or equal to , then you put all these -neighbors to the same cluster . You then examine each member of cluster and find their respective -neighbors. If some member of cluster has or more -neighbors, you expand cluster by adding those -neighbors to the cluster. You continue expanding cluster until there are no more examples to put in it. In the latter case, you pick from the dataset another example not belonging to any cluster and put it to cluster . You continue like this until all examples either belong to some cluster or are marked as outliers. An outlier is an example whose -neighborhood contains less than examples.
DBSCAN 的优点是它可以构建具有任意形状的簇,而 k 均值和其他基于质心的算法创建具有超球面形状的簇。 DBSCAN 的一个明显缺点是它有两个超参数并为它们选择好的值(尤其是)可能具有挑战性。此外,具有固定的,聚类算法不能有效地处理不同密度的聚类。
The advantage of DBSCAN is that it can build clusters that have an arbitrary shape, while k-means and other centroid-based algorithms create clusters that have a shape of a hypersphere. An obvious drawback of DBSCAN is that it has two hyperparameters and choosing good values for them (especially ) could be challenging. Furthermore, having fixed, the clustering algorithm cannot effectively deal with clusters of varying density.
HDBSCAN是保留 DBSCAN 优点的聚类算法,无需决定。该算法能够构建不同密度的集群。 HDBSCAN 是多种思想的巧妙结合,完整描述该算法超出了本书的范围。
HDBSCAN is the clustering algorithm that keeps the advantages of DBSCAN, by removing the need to decide on the value of . The algorithm is capable of building clusters of varying density. HDBSCAN is an ingenious combination of multiple ideas and describing the algorithm in full is beyond the scope of this book.
HDBSCAN 只有一个重要的超参数:,集群中放置的最小示例数。这个超参数比较简单,凭直觉选择。 HDBSCAN 的实现速度非常快:它可以有效地处理数百万个示例。尽管 k 均值的现代实现比 HDBSCAN 快得多,但对于许多实际任务来说,后者的优点可能超过其缺点。我建议始终首先尝试对数据进行 HDBSCAN。
HDBSCAN only has one important hyperparameter: , the minimum number of examples to put in a cluster. This hyperparameter is relatively simple to choose by intuition. HDBSCAN has very fast implementations: it can deal with millions of examples effectively. Modern implementations of k-means are much faster than HDBSCAN, though, but the qualities of the latter may outweigh its drawbacks for many practical tasks. I recommend to always trying HDBSCAN on your data first.
最重要的问题是您的数据集有多少个集群?当特征向量是一维、二维或三维时,您可以查看数据并看到特征空间中的点“云”。每个云都是一个潜在的集群。然而,对于- 维度数据,、看数据有问题1.
The most important question is how many clusters does your dataset have? When the feature vectors are one-, two- or three-dimensional, you can look at the data and see “clouds” of points in the feature space. Each cloud is a potential cluster. However, for -dimensional data, with , looking at the data is problematic1.
确定合理簇数的一种方法是基于预测强度的概念。这个想法是将数据分为训练集和测试集,类似于我们在监督学习中的做法。一旦你有了训练集和测试集,尺寸的和尺寸的分别,你修复,聚类数量,并运行聚类算法在片场和并得到聚类结果和。
One way of determining the reasonable number of clusters is based on the concept of prediction strength. The idea is to split the data into training and test set, similarly to how we do in supervised learning. Once you have the training and test sets, of size and of size respectively, you fix , the number of clusters, and run a clustering algorithm on sets and and obtain the clustering results and .
让是聚类使用训练集构建。簇在可以看作是区域。如果一个示例属于这些区域之一,则该示例属于某个特定的集群。例如,如果我们将 k-means 算法应用于某个数据集,则会将特征空间划分为多边形区域,如图所示。 46 .
Let be the clustering built using the training set. The clusters in can be seen as regions. If an example falls within one of those regions, then that example belongs to some specific cluster. For example, if we apply the k-means algorithm to some dataset, it results in a partition of the feature space into polygonal regions, as we saw in fig. 46.
定义 共同成员矩阵 如下:当且仅当示例和根据聚类结果,测试集中属于同一簇。否则。
Define the co-membership matrix as follows: if and only if examples and from the test set belong to the same cluster according to the clustering . Otherwise .
让我们休息一下,看看这里有什么。我们使用训练示例集构建了一个聚类具有集群。然后我们构建了共同隶属矩阵,该矩阵指示测试集中的两个示例是否属于同一簇。
Let’s take a break and see what we have here. We have built, using the training set of examples, a clustering that has clusters. Then we have built the co-membership matrix that indicates whether two examples from the test set belong to the same cluster in .
直观上来说,如果数量是合理的簇数,那么聚类中属于同一簇的两个例子在聚类中很可能属于同一个簇。另一方面,如果不合理(太高或太低),那么基于训练数据和基于测试数据的聚类可能会不太一致。
Intuitively, if the quantity is the reasonable number of clusters, then two examples that belong to the same cluster in clustering will most likely belong to the same cluster in clustering . On the other hand, if is not reasonable (too high or too low), then training data-based and test data-based clusterings will likely be less consistent.
使用如图所示的数据。 47,这个想法如下图所示:
Using the data shown in fig. 47, the idea is illustrated below:
图中的情节。 图48a和图48a。 48b分别显示和及其各自的集群区域。
The plots in fig. 48a and fig. 48b show respectively and with their respective cluster regions.
绘制在训练数据集群区域上的测试示例如图 2 所示。 48 c.你可以在图中看到。 从图48c可以看出,根据从训练数据获得的聚类区域,橙色测试示例不再属于同一聚类。这将导致矩阵中有许多零反过来,这是一个指标可能不是最佳的簇数。
Test examples plotted over the training data cluster regions are shown in fig. 48c. You can see in fig. 48c that orange test examples don’t belong anymore to the same cluster according to the clustering regions obtained from the training data. This will result in many zeroes in the matrix which, in turn, is an indicator that is likely not the best number of clusters.
更正式地说,是簇数量的预测强度是(谁)给的,
More formally, the prediction strength for the number of clusters is given by,
在哪里,是从聚类中聚类和是集群中的示例数。
where , is cluster from the clustering and is the number of examples in cluster .
给定一个聚类,对于每个测试集群,我们计算该集群中也由训练集质心分配给同一集群的观察对的比例。预测强度是该数量的最小值测试集群。
Given a clustering , for each test cluster, we compute the proportion of observation pairs in that cluster that are also assigned to the same cluster by the training set centroids. The prediction strength is the minimum of this quantity over the test clusters.
实验表明合理的簇数是最大的这样上面是。在图中。 49,您可以看到不同值的预测强度的示例对于二簇、三簇和四簇数据。
Experiments suggest that a reasonable number of clusters is the largest such that is above . In fig. 49, you can see examples of predictive strength for different values of for two-, three- and four-cluster data.
对于非确定性聚类算法,例如 k-means,它可以根据质心的初始位置生成不同的聚类,建议对同一聚类算法进行多次运行并计算平均预测强度经过多次运行。
For non-deterministic clustering algorithms, such as k-means, which can generate different clusterings depending on the initial positions of centroids, it is recommended to do multiple runs of the clustering algorithm for the same and compute the average prediction strength over multiple runs.
估计聚类数量的另一种有效方法是间隙统计方法。其他不太自动化的方法(一些分析师仍在使用)包括肘法和平均轮廓法。
Another effective method to estimate the number of clusters is the gap statistic method. Other, less automatic methods, which some analysts still use, include the elbow method and the average silhouette method.
DBSCAN 和 k-means 计算所谓的硬聚类,其中每个示例只能属于一个聚类。高斯混合模型(GMM) 允许每个示例成为具有不同成员分数的多个集群的成员(HDBSCAN 也允许这样做)。计算 GMM 与进行基于模型的密度估计非常相似。在 GMM 中,我们不是只有一个多元正态分布 (MND),而是多个 MND 的加权和:
DBSCAN and k-means compute so-called hard clustering, in which each example can belong to only one cluster. Gaussian mixture model (GMM) allows each example to be a member of several clusters with different membership score (HDBSCAN also allows this). Computing a GMM is very similar to doing model-based density estimation. In GMM, instead of having just one multivariate normal distribution (MND), we have a weighted sum of several MNDs:
在哪里是一个运动神经元病, 和是其总和中的权重。参数值,, 和, 对全部使用期望最大化算法(EM)优化最大似然准则获得。
where is a MND , and is its weight in the sum. The values of parameters , , and , for all are obtained using the expectation maximization algorithm (EM) to optimize the maximum likelihood criterion.
同样,为了简单起见,让我们看一下一维数据。还假设有两个簇:。在这种情况下,我们有两个高斯分布,
Again, for simplicity, let us look at the one-dimensional data. Also assume that there are two clusters: . In this case, we have two Gaussian distributions,
和
and
在哪里和是两个定义可能性的 pdf。
where and are two pdf defining the likelihood of .
我们使用EM算法来估计,,,,, 和。参数和对于密度估计很有用,而对于聚类则不太有用,正如我们将在下面看到的。
We use the EM algorithm to estimate , , , , , and . The parameters and are useful for the density estimation and less useful for clustering, as we will see below.
EM 的工作原理如下。一开始,我们猜测初始值,,, 和,并设置(一般来说,这是对于每个,)。
EM works as follows. In the beginning, we guess the initial values for , , , and , and set (in general, it’s for each , ).
在 EM 的每次迭代中,都会执行以下四个步骤:
At each iteration of EM, the following four steps are executed:
和
and
参数反映了我们的高斯分布的可能性有多大带参数和可能已经生成了我们的数据集。这就是为什么我们一开始就设定:我们不知道两个高斯函数的可能性如何,我们通过将两者的可能性设置为二分之一来反映我们的无知。
The parameter reflects how likely is that our Gaussian distribution with parameters and may have produced our dataset. That is why in the beginning we set : we don’t know how each of the two Gaussians is likely, and we reflect our ignorance by setting the likelihood of both to one half.
和
and
步骤迭代执行直到值和变化不大:例如,变化低于某个阈值。其过程如图所示。 50以下。
The steps are executed iteratively until the values and don’t change much: for example, the change is below some threshold . The precess is illustrated in fig. 50 below.
您可能已经注意到,EM 算法与 k 均值算法非常相似:从随机集群开始,然后通过对分配给该集群的数据进行平均来迭代更新每个集群的参数。 GMM 的唯一区别是示例的分配到集群很软:属于簇有概率。这就是我们计算新值的原因和在等式中 28和等式。 29不是平均值(用于 k 均值),而是带有权重的加权平均值。
You may have noticed that the EM algorithm is very similar to the k-means algorithm: start with random clusters, then iteratively update each cluster’s parameters by averaging the data that is assigned to that cluster. The only difference in the case of GMM is that the assignment of an example to the cluster is soft: belongs to cluster with probability . This is why we calculate the new values for and in eq. 28 and eq. 29 not as an average (used in k-means) but as a weighted average with weights .
一旦我们了解了参数和对于每个簇, 示例的会员分数簇状是(谁)给的。
Once we have learned the parameters and for each cluster , the membership score of example in cluster is given by .
扩展至维数据()很简单。唯一的区别是,而不是方差,我们现在有了协方差矩阵对多项式正态分布 (MND) 进行参数化。
The extension to -dimensional data () is straightforward. The only difference is that instead of the variance , we now have the covariance matrix that parametrizes the multinomial normal distribution (MND).
与 k-means 中的簇只能是圆形相反,GMM 中的簇具有椭圆形的形式,可以任意伸长和旋转。协方差矩阵中的值控制这些属性。
Contrary to k-means where clusters can only be circular, the clusters in GMM have the form of an ellipse that can have an arbitrary elongation and rotation. The values in the covariance matrix control these properties.
没有普遍认可的方法来选择正确的在 GMM 中。我建议您首先将数据集分为训练集和测试集。然后你尝试不同的并建立一个不同的模型对于每个在训练数据上。你选择的值是最大化测试集中示例的可能性:
There’s no universally recognized method to choose the right in GMM. I recommend that you first split the dataset into training and test set. Then you try different and build a different model for each on the training data. You pick the value of that maximizes the likelihood of examples in the test set:
在哪里是测试集的大小。
where is the size of the test set.
文献中描述了多种聚类算法。值得一提的是谱聚类和层次聚类。对于某些数据集,您可能会发现它们更合适。然而,在大多数实际情况下,k-means、HDBSCAN 和高斯混合模型就可以满足您的需求。
There is a variety of clustering algorithms described in the literature. Worth mentioning are spectral clustering and hierarchical clustering. For some datasets, you may find those more appropriate. However, in most practical cases, k-means, HDBSCAN and the Gaussian mixture model would satisfy your needs.
现代机器学习算法,例如集成算法和神经网络,可以很好地处理非常高维的示例,多达数百万个特征。对于现代计算机和图形处理单元 (GPU),降维技术在实践中的使用比过去更少。降维最常见的用例是数据可视化:人类最多只能解释绘图上的三个维度。
Modern machine learning algorithms, such as ensemble algorithms and neural networks, handle well very high-dimensional examples, up to millions of features. With modern computers and graphical processing units (GPUs), dimensionality reduction techniques are used less in practice than in the past. The most frequent use case for dimensionality reduction is data visualization: humans can only interpret a maximum of three dimensions on a plot.
另一种可以从降维中受益的情况是,当您必须构建可解释的模型时,您在学习算法的选择上受到限制。例如,您只能使用决策树学习或线性回归。通过将数据减少到较低的维度,并确定减少的特征空间中的每个新特征反映的原始示例的质量,您可以使用更简单的算法。降维去除冗余或高度相关的特征;它还减少了数据中的噪声——所有这些都有助于模型的可解释性。
Another situation in which you could benefit from dimensionality reduction is when you have to build an interpretable model and to do so you are limited in your choice of learning algorithms. For example, you can only use decision tree learning or linear regression. By reducing your data to lower dimensionality and by figuring out which quality of the original example each new feature in the reduced feature space reflects, you can use simpler algorithms. Dimensionality reduction removes redundant or highly correlated features; it also reduces the noise in the data — all that contributes to the interpretability of the model.
三种广泛使用的降维技术是主成分分析(PCA)、均匀流形逼近和投影(UMAP) 以及自动编码器。
Three widely used techniques of dimensionality reduction are principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and autoencoders.
我已经在第 7 章解释了自动编码器。您可以使用自动编码器瓶颈层的低维输出作为表示高维输入特征向量的降维向量。您知道这个低维向量代表了输入向量中包含的基本信息,因为自动编码器能够仅根据瓶颈层输出来重建输入特征向量。
I already explained autoencoders in Chapter 7. You can use the low-dimensional output of the bottleneck layer of the autoencoder as the vector of reduced dimensionality that represents the high-dimensional input feature vector. You know that this low-dimensional vector represents the essential information contained in the input vector because the autoencoder is capable of reconstructing the input feature vector based on the bottleneck layer output alone.
主成分分析(PCA)是最古老的降维方法之一。它背后的数学涉及到我在第 2 章中没有解释的矩阵运算,因此我将 PCA 的数学留给您进一步阅读。在这里,我仅提供直觉并通过示例说明该方法。
Principal component analysis or PCA is one of the oldest dimensionality reduction methods. The math behind it involves operation on matrices that I didn’t explain in Chapter 2, so I leave the math of PCA for your further reading. Here, I only provide intuition and illustrate the method on an example.
考虑如图 2 所示的二维数据集。 51 a.
Consider a two-dimensional dataset as shown in fig. 51a.
主成分是定义新坐标系的向量,其中第一个轴沿着数据中最高方差的方向。第二个轴与第一个轴正交,并且沿着数据中第二大方差的方向。如果我们的数据是三维的,则第三个轴将与第一轴和第二轴正交,并沿第三最高方差的方向行进,依此类推。在图中。 在图51b中,两个主要部件如箭头所示。箭头的长度反映了该方向的方差。
Principal components are vectors that define a new coordinate system in which the first axis goes in the direction of the highest variance in the data. The second axis is orthogonal to the first one and goes in the direction of the second highest variance in the data. If our data was three-dimensional, the third axis would be orthogonal to both the first and the second axes and go in the direction of the third highest variance, and so on. In fig. 51b, the two principal components are shown as arrows. The length of the arrow reflects the variance in this direction.
现在,如果我们想将数据的维度减少到,我们选择最大的主成分并将我们的数据点投影到它们上。对于我们的二维插图,我们可以设置并将我们的示例投影到第一个主成分以获得图 1 中的橙色点。 51 c.
Now, if we want to reduce the dimensionality of our data to , we pick largest principal components and project our data points on them. For our two-dimensional illustration, we can set and project our examples to the first principal component to obtain the orange points in fig. 51c.
为了描述每个橙色点,我们只需要一个坐标而不是两个:相对于第一个主成分的坐标。当我们的数据非常高维时,在实践中经常发生前两个或三个主成分解释了数据的大部分变化,因此通过在 2D 或 3D 图上显示数据,我们确实可以看到非常高的维数。维数据及其属性。
To describe each orange point, we need only one coordinate instead of two: the coordinate with respect to the first principal component. When our data is very high-dimensional, it often happens in practice that the first two or three principal components account for most of the variation in the data, so by displaying the data on a 2D or 3D plot we can indeed see a very high-dimensional data and its properties.
许多现代降维算法,特别是那些专门为可视化目的而设计的算法(例如t-SNE和UMAP)背后的想法基本上是相同的。我们首先为两个示例设计相似性度量。出于可视化目的,除了两个示例之间的欧几里得距离之外,这种相似性度量通常还反映两个示例的一些局部属性,例如它们周围其他示例的密度。
The idea behind many of the modern dimensionality reduction algorithms, especially those designed specifically for visualization purposes such as t-SNE and UMAP, is basically the same. We first design a similarity metric for two examples. For visualization purposes, besides the Euclidean distance between the two examples, this similarity metric often reflects some local properties of the two examples, such as the density of other examples around them.
在 UMAP 中,这个相似度度量定义如下,
In UMAP, this similarity metric is defined as follows,
功能定义为,
The function is defined as,
在哪里是两个例子之间的欧几里德距离,是距离到它最近的邻居,并且是距离对其最近邻(是算法的超参数)。
where is the Euclidean distance between two examples, is the distance from to its closest neighbor, and is the distance from to its closest neighbor ( is a hyperparameter of the algorithm).
可以证明方程中的度量。 30 的变化范围为到并且是对称的,这意味着。
It can be shown that the metric in eq. 30 varies in the range from to and is symmetric, which means that .
让表示原始高维空间中两个示例的相似度,并让是由相同方程给出的相似度。 30在新的低维空间中。
Let denote the similarity of two examples in the original high-dimensional space and let be the similarity given by the same eq. 30 in the new low-dimensional space.
为了继续,我需要快速介绍模糊集的概念。模糊集是集合的概括。对于每个元素在模糊集合中,有一个隶属函数定义了会员的实力到集合。我们这么说弱属于模糊集如果接近于零。另一方面,如果接近, 然后拥有强大的会员资格。如果对全部,然后是一个模糊集变得等价于一个正常的、非模糊的集合。
To continue, I need to quickly introduce the notion of a fuzzy set. A fuzzy set is a generalization of a set. For each element in a fuzzy set , there’s a membership function that defines the membership strength of to the set . We say that weakly belongs to a fuzzy set if is close to zero. On the other hand, if is close to , then has a strong membership in . If for all , then a fuzzy set becomes equivalent to a normal, nonfuzzy set.
现在让我们看看为什么我们需要模糊集的概念。
Let’s now see why we need this notion of a fuzzy set here.
因为价值观和位于之间的范围和, 我们可以看到作为这对例子的成员在某个模糊集合中。同样可以这样说。两个模糊集的相似性概念称为模糊集交叉熵,定义为:
Because the values of and lie in the range between and , we can see as membership of the pair of examples in a certain fuzzy set. The same can be said about . The notion of similarity of two fuzzy sets is called fuzzy set cross-entropy and is defined as,
在哪里是原始高维示例的低维“版本”。
where is the low-dimensional “version” of the original high-dimensional example .
在等式中。 31未知参数为(对全部),我们寻找的低维例子。我们可以通过最小化梯度下降来计算它们。
In eq. 31 the unknown parameters are (for all ), the low-dimensional examples we look for. We can compute them by gradient descent by minimizing .
在图中。 52-图。 如图 54 所示,您可以看到对手写数字的 MNIST 数据集进行降维的结果。
In fig. 52-fig. 54, you can see the result of dimensionality reduction applied to the MNIST dataset of handwritten digits.
MNIST 通常用于对各种图像处理系统进行基准测试;它包含 70,000 个带标签的示例。绘图上的十种不同颜色对应于十个类别。图中的每个点对应于数据集中的一个特定示例。正如您所看到的,UMAP 在视觉上更好地分离了示例(请记住,它无法访问标签)。实际上,UMAP 比 PCA 稍慢,但比自动编码器快。
MNIST is commonly used for benchmarking various image processing systems; it contains 70,000 labeled examples. Ten different colors on the plot correspond to ten classes. Each point on the plot corresponds a specific example in the dataset. As you can see, UMAP separates examples visually better (remember, it doesn’t have access to labels). In practice, UMAP is slightly slower than PCA but faster than autoencoder.
异常值检测是检测数据集中与数据集中的典型示例非常不同的示例的问题。我们已经看到了几种可以帮助解决这个问题的技术:自动编码器和一类分类器学习。如果我们使用自动编码器,我们会在数据集上训练它。然后,如果我们想要预测一个示例是否是异常值,我们可以使用自动编码器模型从瓶颈层重建示例。该模型不太可能能够重建异常值。
Outlier detection is the problem of detecting the examples in the dataset that are very different from what a typical example in the dataset looks like. We have already seen several techniques that could help to solve this problem: autoencoder and one-class classifier learning. If we use an autoencoder, we train it on our dataset. Then, if we want to predict whether an example is an outlier, we can use the autoencoder model to reconstruct the example from the bottleneck layer. The model will unlikely be capable of reconstructing an outlier.
在一类分类中,模型要么预测输入示例属于该类,要么预测它是异常值。
In one-class classification, the model either predicts that the input example belongs to the class, or it’s an outlier.
一些分析师查看多个二维图,其中同时存在一对特征。它可能会给出关于簇数量的直觉。然而,这种方法存在主观性,容易出错,只能算作有根据的猜测,而不是科学方法。↩
Some analysts look at multiple two-dimensional plots, in which only a pair of features are present at the same time. It might give an intuition about the number of clusters. However, such an approach suffers from subjectivity, is prone to error and counts as an educated guess rather than a scientific method.↩
我提到两个特征向量之间最常用的相似性(或相异性)度量是欧几里德距离和余弦相似性。这种度量的选择看似合乎逻辑,但却是任意的,就像线性回归中平方误差的选择(或线性回归本身的形式)一样。事实上,根据数据集的不同,一个指标可以比另一个指标更好,这一事实表明它们都不是完美的。
I mentioned that the most frequently used metrics of similarity (or dissimilarity) between two feature vectors are Euclidean distance and cosine similarity. Such choices of metric seem logical but arbitrary, just like the choice of the squared error in linear regression (or the form of linear regression itself). The fact that one metric can work better than another depending on the dataset is an indicator that none of them are perfect.
您可以创建一个更适合您的数据集的指标。然后可以将您的指标集成到任何需要指标的学习算法中,例如 k-means 或 kNN。在不尝试所有可能性的情况下,您如何知道哪个方程是一个好的度量标准?正如您已经猜到的,可以从数据中学习度量。
You can create a metric that would work better for your dataset. It’s then possible to integrate your metric into any learning algorithm that needs a metric, like k-means or kNN. How can you know, without trying all possibilities, which equation would be a good metric? As you could already guess, a metric can be learned from data.
记住两个特征向量之间的欧几里得距离和:
Remember the Euclidean distance between two feature vectors and :
我们可以稍微修改这个指标以使其可参数化,然后从数据中学习这些参数。考虑以下修改:
We can slightly modify this metric to make it parametrizable and then learn these parameters from data. Consider the following modification:
在哪里是一个矩阵。比方说。如果我们让是单位矩阵,
where is a matrix. Let’s say . If we let be the identity matrix,
然后变为欧几里得距离。如果我们有一个通用的对角矩阵,如下所示:
then becomes the Euclidean distance. If we have a general diagonal matrix, like this:
那么不同的维度在度量中具有不同的重要性。 (在上面的示例中,第二个维度是度量计算中最重要的。)更一般而言,要称为度量,两个变量的函数必须满足三个条件:
then different dimensions have different importance in the metric. (In the above example, the second dimension is the most important in the metric calculation.) More generally, to be called a metric a function of two variables has to satisfy three conditions:
为了满足前两个条件,矩阵必须是半正定的。您可以将正半定矩阵视为非负实数概念到矩阵的推广。任意正半定矩阵满足:
To satisfy the first two conditions, the matrix has to be positive semidefinite. You can see a positive semidefinite matrix as the generalization of the notion of a nonnegative real number to matrices. Any positive semidefinite matrix satisfies:
对于任意向量具有与行数和列数相同的维度。
for any vector having the same dimensionality as the number of rows and columns in .
上述性质由半正定矩阵的定义得出。证明当矩阵满足第二个条件是半正定的,可以在本书的姊妹网站上找到。
The above property follows from the definition of a positive semidefinite matrix. The proof that the second condition is satisfied when the matrix is positive semidefinite can be found on the book’s companion website.
为了满足第三个条件,我们可以简单地取。
To satisfy the third condition, we can simply take .
假设我们有一个未注释的集合。为了构建度量学习问题的训练数据,我们手动创建两组。第一组是这样的,有一对例子属于集合如果和相似(从我们的主观角度来看)。第二组是这样的,有一对例子属于集合如果和是不同的。
Let’s say we have an unannotated set . To build the training data for our metric learning problem, we manually create two sets. The first set is such that a pair of examples belongs to set if and are similar (from our subjective perspective). The second set is such that a pair of examples belongs to set if and are dissimilar.
训练参数矩阵从数据中,我们想要找到一个正半定矩阵解决以下优化问题:
To train the matrix of parameters from the data, we want to find a positive semidefinite matrix that solves the following optimization problem:
这样:
such that:
在哪里是一个正常数(可以是任何数字)。
where is a positive constant (can be any number).
该优化问题的解决方案是通过梯度下降找到的,并进行修改以确保找到的矩阵是半正定的。我们将算法的描述排除在本书的范围之外以供进一步阅读。
The solution to this optimization problem is found by gradient descent with a modification that ensures that the found matrix is positive semidefinite. We leave the description of the algorithm out of the scope of this book for further reading.
我应该指出,使用孪生网络和三元组损失的一次性学习可以被视为度量学习问题:同一个人的成对图片属于该集合,而随机图片对属于。
I should point out that one-shot learning with siamese networks and triplet loss can be seen as metric learning problem: the pairs of pictures of the same person belong to the set , while pairs of random pictures belong to .
还有许多其他方法可以学习度量,包括非线性和基于内核的方法。然而,本书中介绍的内容以及一次性学习的改编应该足以满足大多数实际应用。
There are many other ways to learn a metric, including non-linear and kernel-based. However, the one presented in this book, as well as the adaptation of one-shot learning, should suffice for most practical applications.
学习排名是一个监督学习问题。其中,使用学习排名解决的一个常见问题是优化搜索引擎针对查询返回的搜索结果。在搜索结果排名优化中,一个带标签的例子在训练集中的大小是大小文档的排序集合(标签是文档的等级)。特征向量代表集合中的每个文档。学习的目标是找到一个排名函数它输出可用于对文档进行排名的值。对于每个训练示例,都有一个理想函数将输出导致与标签给出的文档排名相同的值。
Learning to rank is a supervised learning problem. Among others, one frequent problem solved using learning to rank is the optimization of search results returned by a search engine for a query. In search result ranking optimization, a labeled example in the training set of size is a ranked collection of documents of size (labels are ranks of documents). A feature vector represents each document in the collection. The goal of the learning is to find a ranking function which outputs values that can be used to rank documents. For each training example, an ideal function would output values that induce the same ranking of documents as given by the labels.
每个例子,,是带有标签的特征向量的集合:。特征向量中的特征代表文档。例如,可以代表该文档的最新程度,将反映查询的单词是否可以在文档标题中找到,可以表示文档的大小等等。标签可能是排名()或分数。例如,分数越低,文档的排名就应该越高。
Each example , , is a collection of feature vectors with labels: . Features in a feature vector represent the document . For example, could represent how recent is the document, would reflect whether the words of the query can be found in the document title, could represent the size of the document, and so on. The label could be the rank () or a score. For example, the lower the score, the higher the document should be ranked.
有三种方法可以解决该问题:逐点、成对和列表。
There are three approaches to solve that problem: pointwise, pairwise, and listwise.
逐点方法将每个训练示例转换为多个示例:每个文档一个示例。学习问题成为标准的监督学习问题,要么是回归问题,要么是逻辑回归问题。在每个例子中的逐点学习问题,是某个文档的特征向量,并且是原始分数(如果是分数)或者是排名得到的综合分数(排名越高,综合分数越低)。在这种情况下可以使用任何监督学习算法。该解决方案通常远非完美。原则上,这是因为每个文档都是孤立考虑的,而原始排名(由标签给出)原始训练集)可以优化整个文档集的位置。例如,如果我们已经在某些文档集合中为维基百科页面提供了高排名,那么我们宁愿不为同一查询的另一个维基百科页面提供高排名。
The pointwise approach transforms each training example into multiple examples: one example per document. The learning problem becomes a standard supervised learning problem, either regression or logistic regression. In each example of the pointwise learning problem, is the feature vector of some document, and is the original score (if is a score) or a synthetic score obtained from the ranking (the higher the rank, the lower the synthetic score). Any supervised learning algorithm can be used in this case. The solution is usually far from perfect. Principally, this is because each document is considered in isolation, while the original ranking (given by the labels of the original training set) could optimize the positions of the whole set of documents. For example, if we have already given a high rank to a Wikipedia page in some collection of documents, we would prefer not giving a high rank to another Wikipedia page for the same query.
在成对方法中,我们还单独考虑文档,但是在这种情况下,会同时考虑一对文档。给定一对文档我们建立一个模型,其中,给定作为输入,输出一个接近于, 如果应该高于在排名中;否则,输出一个接近于。在测试时,未标记示例的最终排名通过聚合所有文档对的预测来获得。成对方法比逐点方法效果更好,但仍远非完美。
In the pairwise approach, we also consider documents in isolation, but, in this case, a pair of documents is considered at once. Given a pair of documents we build a model , which, given as input, outputs a value close to , if should be higher than in the ranking; otherwise, outputs a value close to . At the test time, the final ranking for an unlabeled example is obtained by aggregating the predictions for all pairs of documents in . The pairwise approach works better than pointwise, but is still far from perfect.
最先进的排名学习算法(例如LambdaMART)实现了列表方法。在列表方法中,我们尝试直接根据一些反映排名质量的指标来优化模型。评估搜索引擎结果排名的指标有多种,包括精确度和召回率。一种结合了精度和召回率的流行指标称为平均精度(MAP)。
The state of the art rank learning algorithms, such as LambdaMART, implement the listwise approach. In the listwise approach, we try to optimize the model directly on some metric that reflects the quality of ranking. There are various metrics for assessing search engine result ranking, including precision and recall. One popular metric that combines both precision and recall is called mean average precision (MAP).
为了定义 MAP,让我们要求法官(Google 称这些人为排名者)检查某个查询的搜索结果集合,并为每个搜索结果分配相关性标签。标签可以是二进制的(对于“相关”和“不相关”)或在某种程度上,例如到:值越高,文档与搜索查询越相关。让我们的法官为一系列内容建立这样的相关性标签查询。现在,让我们在这个集合上测试我们的排名模型。我们的模型对于某些查询的精度由下式给出:
To define MAP, let us ask judges (Google call those people rankers) to examine a collection of search results for a query and assign relevancy labels to each search result. Labels could be binary ( for “relevant” and for “irrelevant”) or on some scale, say from to : the higher the value, the more relevant the document is to the search query. Let our judges build such relevancy labeling for a collection of queries. Now, let us test our ranking model on this collection. The precision of our model for some query is given by:
在哪里代表“相关文件”,代表“检索的文档”,符号意思是“数量”。平均精度度量AveP 是为搜索引擎针对查询返回的排序文档集合定义的作为,
where stands for “relevant documents”, stands for “retrieved docs”, and the notation means “the number of.” The average precision metric, AveP, is defined for a ranked collection of documents returned by a search engine for a query as,
在哪里是检索到的文档数量,表示顶部计算的精度我们的排名模型针对查询返回的搜索结果,是一个指示函数,等于如果该项目处于排名是相关文件(根据法官的判断),否则为零。最后,大小搜索查询集合的 MAP是(谁)给的,
where is the number of retrieved documents, denotes the precision computed for the top search results returned by our ranking model for the query, is an indicator function equaling if the item at rank is a relevant document (according to judges) and zero otherwise. Finally, the MAP for a collection of search queries of size is given by,
现在我们回到 LambdaMART。该算法实现了列表方式,并使用梯度提升来训练排名函数。然后,二元模型预测该文档是否应该比文档有更高的排名(对于相同的搜索查询)由带有超参数的 sigmoid 给出,
Now we get back to LambdaMART. This algorithm implements a listwise approach, and it uses gradient boosting to train the ranking function . Then, the binary model that predicts whether the document should have a higher rank than the document (for the same search query) is given by a sigmoid with a hyperparameter ,
同样,与许多预测概率的模型一样,成本函数是使用模型计算的交叉熵。在梯度提升中,我们结合多个回归树来构建函数通过尝试最小化成本。请记住,在梯度提升中,我们向模型添加一棵树,以减少当前模型在训练数据上产生的误差。对于分类问题,我们计算了成本函数的导数,以用这些导数替换训练示例的真实标签。 LambdaMART 的工作原理类似,但有一个例外。它用梯度和另一个取决于度量的因素(例如 MAP)的组合来替换真实梯度。该因子通过增加或减少原始梯度来修改原始梯度,从而提高度量值。
Again, as with many models that predict probability, the cost function is cross-entropy computed using the model . In our gradient boosting, we combine multiple regression trees to build the function by trying to minimize the cost. Remember that in gradient boosting we add a tree to the model to reduce the error that the current model makes on the training data. For the classification problem, we computed the derivative of the cost function to replace real labels of training examples with these derivatives. LambdaMART works similarly, with one exception. It replaces the real gradient with a combination of the gradient and another factor that depends on the metric, such as MAP. This factor modifies the original gradient by increasing or decreasing it so that the metric value is improved.
这是一个非常聪明的想法,没有多少监督学习算法可以夸耀它们直接优化指标。优化指标是我们真正想要的,但在典型的监督学习算法中我们所做的是优化成本而不是指标(因为指标通常是不可微分的)。通常,在监督学习中,一旦我们找到优化成本函数的模型,我们就会尝试调整超参数以提高指标的值。 LambdaMART 直接优化指标。
That is a very bright idea and not many supervised learning algorithms can boast that they optimize a metric directly. Optimizing a metric is what we really want, but what we do in a typical supervised learning algorithm is we optimize the cost instead of the metric (because metrics are usually not differentiable). Usually, in supervised learning, as soon as we have found a model that optimizes the cost function, we try to tweak hyperparameters to improve the value of the metric. LambdaMART optimizes the metric directly.
剩下的问题是我们如何根据模型的预测构建结果的排名列表它预测其第一个输入的排名是否必须高于第二个输入。这通常是一个计算难题,并且有多种排名器实现能够将成对比较转换为排名列表。
The remaining question is how do we build the ranked list of results based on the predictions of the model which predicts whether its first input has to be ranked higher than the second input. It’s generally a computationally hard problem, and there are multiple implementations of rankers capable of transforming pairwise comparisons into a ranking list.
最直接的方法是使用现有的排序算法。排序算法按升序或降序对数字集合进行排序。 (最简单的排序算法称为冒泡排序。它通常在工程学校教授。)通常,排序算法会迭代地比较集合中的一对数字,并根据比较结果更改它们在列表中的位置。如果我们插入我们的函数进入排序算法来执行此比较,排序算法将排序文档而不是数字。
The most straightforward approach is to use an existing sorting algorithm. Sorting algorithms sort a collection of numbers in increasing or decreasing order. (The simplest sorting algorithm is called bubble sort. It’s usually taught in engineering schools.) Typically, sorting algorithms iteratively compare a pair of numbers in the collection and change their positions in the list based on the result of that comparison. If we plug our function into a sorting algorithm to execute this comparison, the sorting algorithm will sort documents and not numbers.
学习推荐是构建推荐系统的一种方法。通常,我们有一个消费内容的用户。我们有消费历史,并希望向该用户推荐他们喜欢的新内容。它可以是 Netflix 上的电影或亚马逊上的一本书。
Learning to recommend is an approach to building recommender systems. Usually, we have a user who consumes content. We have the history of consumption and want to suggest new content to this user that they would like. It could be a movie on Netflix or a book on Amazon.
传统上,使用两种方法来提供推荐:基于内容的过滤和协作过滤。
Traditionally, two approaches were used to give recommendations: content-based filtering and collaborative filtering.
基于内容的过滤包括根据用户消费内容的描述来了解用户喜欢什么。例如,如果新闻网站的用户经常阅读科技新闻文章,那么我们会向该用户推荐更多科技文档。更一般地说,我们可以为每个用户创建一个训练集,并将新闻文章作为特征向量添加到该数据集中以及用户最近是否阅读过这篇新闻文章作为标签。然后我们构建每个用户的模型,并可以定期检查每个新内容以确定特定用户是否会阅读它。
Content-based filtering consists of learning what users like based on the description of the content they consume. For example, if the user of a news site often reads news articles on science and technology, then we would suggest more documents on science and technology to this user. More generally, we could create one training set per user and add news articles to this dataset as a feature vector and whether the user recently read this news article as a label . Then we build the model of each user and can regularly examine each new piece of content to determine whether a specific user would read it or not.
基于内容的方法有很多局限性。例如,用户可能会陷入所谓的过滤气泡中:系统总是会向该用户建议看起来与用户已经消费的信息非常相似的信息。这可能会导致用户与不同意其观点或扩展其观点的信息完全隔离。从更实际的角度来看,用户可能会停止遵循推荐,这是不可取的。
The content-based approach has many limitations. For example, the user can be trapped in the so-called filter bubble: the system will always suggest to that user the information that looks very similar to what user already consumed. That could result in complete isolation of the user from information that disagrees with their viewpoints or expands them. On a more practical side, the users might just stop following recommendations, which is undesirable.
与基于内容的过滤相比,协作过滤具有显着优势:对一个用户的推荐是根据其他用户的消费或评分来计算的。例如,如果两个用户对相同的十部电影给予高评价,则用户 1 更有可能欣赏基于用户 2 的口味推荐的新电影,反之亦然。这种方法的缺点是忽略了推荐项目的内容。
Collaborative filtering has a significant advantage over content-based filtering: the recommendations to one user are computed based on what other users consume or rate. For instance, if two users gave high ratings to the same ten movies, then it’s more likely that user 1 will appreciate new movies recommended based on the tastes of the user 2 and vice versa. The drawback of this approach is that the content of the recommended items is ignored.
在协同过滤中,有关用户偏好的信息被组织在矩阵中。每一行对应一个用户,每一列对应用户评分或消费的一段内容。通常,这个矩阵巨大且极其稀疏,这意味着它的大部分单元格都未填充(或填充零)。造成这种稀疏性的原因是大多数用户仅消费或评价可用内容项的一小部分。基于如此稀疏的数据很难提出有意义的建议。
In collaborative filtering, the information on user preferences is organized in a matrix. Each row corresponds to a user, and each column corresponds to a piece of content that user rated or consumed. Usually, this matrix is huge and extremely sparse, which means that most of its cells aren’t filled (or filled with a zero). The reason for such a sparsity is that most users consume or rate just a tiny fraction of available content items. It’s is very hard to make meaningful recommendations based on such sparse data.
大多数现实世界的推荐系统都使用混合方法:它们结合了基于内容和协作过滤模型获得的推荐。
Most real-world recommender systems use a hybrid approach: they combine recommendations obtained by the content-based and collaborative filtering models.
我已经提到,可以使用分类或回归模型来构建基于内容的推荐模型,该模型根据内容的特征来预测用户是否会喜欢该内容。特征的示例可以包括用户喜欢的书籍或新闻文章中的单词、价格、内容的新近度、内容作者的身份等等。
I already mentioned that a content-based recommender model could be built using a classification or regression model that predicts whether a user will like the content based on the content’s features. Examples of features could include the words in books or news articles the user liked, the price, the recency of the content, the identity of the content author and so on.
两种有效的推荐系统学习算法是因子分解机(FM)和去噪自动编码器(DAE)。
Two effective recommender system learning algorithms are factorization machines (FM) and denoising autoencoders (DAE).
因式分解机是一种相对较新的算法。它是专门为稀疏数据集而设计的。让我们来说明一下这个问题。
Factorization machine is a relatively new kind of algorithm. It was explicitly designed for sparse datasets. Let’s illustrate the problem.
在图中。 55您会看到带有标签的稀疏特征向量的示例。每个特征向量表示有关一个特定用户和一部特定电影的信息。蓝色部分中的特征代表用户。用户被编码为 one-hot 向量。绿色部分的特征代表一部电影。电影也被编码为 one-hot 向量。黄色部分中的特征代表蓝色用户对他们评分的每部电影给出的标准化分数。特征表示用户观看过的获得奥斯卡奖的电影的比例。特征表示用户在对电影评分为绿色之前观看的蓝色电影的百分比。目标蓝色用户给绿色电影的评分。
In fig. 55 you see an example of sparse feature vectors with labels. Each feature vector represents information about one specific user and one specific movie. Features in the blue section represent a user. Users are encoded as one-hot vectors. Features in the green section represent a movie. Movies are also encoded as one-hot vectors. Features in the yellow section represent normalized scores the user in blue gave to each movie they rated. Feature represents the ratio of movies with an Oscar among those the user has watched. Feature represents the percentage of the movie watched by the user in blue before they scored the movie in green. The target is the score given by the user in blue to the movie in green.
真正的推荐系统可以拥有数百万用户,因此图中的矩阵可以有数亿行。特征的数量也可能是数百万,这取决于内容选择的丰富程度以及作为数据分析师的你在特征工程方面的创造力。特征和是在特征工程过程中手工制作的,为了说明目的,我只展示了两个特征。
Real recommender systems can have millions of users, so the matrix in Figure can count hundreds of millions of rows. The number of features could also be millions, depending on how rich is the choice of content is and how creative you, as a data analyst, are in feature engineering. Features and were handcrafted during the feature engineering process, and I only show two features for the purposes of illustration.
尝试将回归或分类模型拟合到如此极其稀疏的数据集将导致泛化能力较差。分解机以不同的方式处理这个问题。
Trying to fit a regression or classification model to such an extremely sparse dataset would result in poor generalization. Factorization machines approach this problem differently.
分解机模型定义如下:
The factorization machine model is defined as follows:
在哪里和,,是与线性回归中使用的标量参数类似的标量参数。向量是因素的维向量。是一个超参数,通常比。表达方式是一个点积和因素的向量。正如您所看到的,我们不是寻找一个宽参数向量,因为稀疏性,它不能很好地反映特征之间的交互,而是通过适用于成对交互的附加参数来完成它特征之间。然而,而不是有一个参数对于每次交互,这都会向模型添加大量新参数,我们将其分解进入仅添加模型参数2.
where and , , are scalar parameters similar to those used in linear regression. Vectors are -dimensional vectors of factors. is a hyperparameter and is usually much smaller than . The expression is a dot-product of the and vectors of factors. As you can see, instead looking for one wide vector of parameters, which can reflect interactions between features poorly because of sparsity, we complete it by additional parameters that apply to pairwise interactions between features. However, instead of having a parameter for each interaction, which would add an enormous1 quantity of new parameters to the model, we factorize into by adding only parameters to the model2.
根据问题的不同,损失函数可以是平方误差损失(用于回归)或铰链损失。用于分类,对于铰链损失或逻辑损失,预测为。物流损失定义为,
Depending on the problem, the loss function could be squared error loss (for regression) or hinge loss. For classification with , with hinge loss or logistic loss the prediction is made as . The logistic loss is defined as,
梯度下降可用于优化平均损失。在图中的例子中。 55、标签在,所以这是一个多类问题。我们可以使用一与休息策略将这个多类问题转换为五个二元分类问题。
Gradient descent can be used to optimize the average loss. In the example in fig. 55, the labels are in , so it’s a multiclass problem. We can use the one versus rest strategy to convert this multiclass problem into five binary classification problems.
从第 7 章中,您知道什么是去噪自动编码器:它是一个从瓶颈层重建输入的神经网络。输入会被噪声破坏,而输出不应被噪声破坏,这一事实使得去噪自动编码器成为构建推荐模型的理想工具。
From Chapter 7, you know what a denoising autoencoder is: it’s a neural network that reconstructs its input from the bottleneck layer. The fact that the input is corrupted by noise while the output shouldn’t be makes denoising autoencoders an ideal tool to build a recommender model.
这个想法非常简单:用户可能喜欢的新电影看起来就像是通过某种损坏过程从完整的首选电影集中删除的。去噪自动编码器的目标是重建那些被删除的项目。
The idea is very straightforward: new movies a user could like are seen as if they were removed from the complete set of preferred movies by some corruption process. The goal of the denoising autoencoder is to reconstruct those removed items.
要为我们的去噪自动编码器准备训练集,请从图 2 中的训练集中删除蓝色和绿色特征。 55 .因为现在有些示例变得重复,所以只保留唯一的示例。
To prepare the training set for our denoising autoencoder, remove the blue and green features from the training set in fig. 55. Because now some examples become duplicates, keep only the unique ones.
在训练时,随机用零替换输入特征向量中的一些非零黄色特征。训练自动编码器来重建未损坏的输入。
At the training time, randomly replace some of the non-zero yellow features in the input feature vectors with zeros. Train the autoencoder to reconstruct the uncorrupted input.
在预测时,为用户构建特征向量。特征向量将包括未损坏的黄色特征以及手工制作的特征,例如和。使用经过训练的 DAE 模型来重建未损坏的输入。向用户推荐模型输出得分最高的电影。
At prediction time, build a feature vector for the user. The feature vector will include uncorrupted yellow features as well as the handcrafted features like and . Use the trained DAE model to reconstruct the uncorrupted input. Recommend to the user movies that have the highest scores at the model’s output.
另一种有效的协同过滤模型是具有两个输入和一个输出的 FFNN。请记住第 8 章中的神经网络擅长处理多个同时输入。这里的训练示例是三元组。输入向量是用户的one-hot编码。第二个输入向量是电影的one-hot 编码。输出层可以是 sigmoid(在这种情况下,标签是在) 或 ReLU,在这种情况下可以在一些典型的范围内,例如。
Another effective collaborative-filtering model is an FFNN with two inputs and one output. Remember from Chapter 8 that neural networks are good at handling multiple simultaneous inputs. A training example here is a triplet . The input vector is a one-hot encoding of a user. The second input vector is a one-hot encoding of a movie. The output layer could be either a sigmoid (in which case the label is in ) or ReLU, in which case can be in some typical range, for example.
我们已经在第 7 章中讨论了词嵌入。回想一下,词嵌入是表示单词的特征向量。它们具有相似的单词具有相似的特征向量的特性。您可能想问的问题是这些词嵌入从何而来。答案是(再次):它们是从数据中学习的。
We have already discussed word embeddings in Chapter 7. Recall that word embeddings are feature vectors that represent words. They have the property that similar words have similar feature vectors. The question that you probably wanted to ask is where these word embeddings come from. The answer is (again): they are learned from data.
有很多算法可以学习词嵌入。在这里,我们只考虑其中之一:word2vec,并且只考虑 word2vec 的一种版本,称为skip-gram,它在实践中效果很好。许多语言的预训练 word2vec 嵌入可以在线下载。
There are many algorithms to learn word embeddings. Here, we consider only one of them: word2vec, and only one version of word2vec called skip-gram, which works well in practice. Pretrained word2vec embeddings for many languages are available to download online.
在词嵌入学习中,我们的目标是建立一个模型,可以使用该模型将单词的单热编码转换为词嵌入。让我们的词典包含 10,000 个单词。每个单词的 one-hot 向量是一个 10,000 维向量,除了包含。不同的词有一个在不同的维度。
In word embedding learning, our goal is to build a model which we can use to convert a one-hot encoding of a word into a word embedding. Let our dictionary contain 10,000 words. The one-hot vector for each word is a 10,000-dimensional vector of all zeroes except for one dimension that contains a . Different words have a in different dimensions.
考虑一句话:“我几乎读完了机器学习的书。”现在,考虑我们删除了一个单词“书”的同一个句子。我们的句子变成:“我几乎读完了关于机器学习。”现在我们只保留前面的三个词后面三个字:“读完关于机器学习。”看着这七字窗周围,如果我让你猜什么你可能会说:“书”、“文章”或“论文”。这就是上下文单词如何让您预测它们周围的单词的方式。这也是机器如何得知单词“书”、“纸”和“文章”具有相似的含义:因为它们在多个文本中共享相似的上下文。
Consider a sentence: “I almost finished reading the book on machine learning.” Now, consider the same sentence from which we have removed one word, say “book.” Our sentence becomes: “I almost finished reading the on machine learning.” Now let’s only keep the three words before the and three words after: “finished reading the on machine learning.” Looking at this seven-word window around the , if I ask you to guess what stands for, you would probably say: “book,” “article,” or “paper.” That’s how the context words let you predict the word they surround. It’s also how the machine can learn that words “book,” “paper,” and “article” have a similar meaning: because they share similar contexts in multiple texts.
事实证明,反之亦然:一个词可以预测它周围的上下文。这篇文章“读完关于机器学习”被称为窗口大小为 7 (3 + 1 + 3) 的skip-gram。通过使用网络上可用的文档,我们可以轻松创建数亿个skip-gram。
It turns out that it works the other way around too: a word can predict the context that surrounds it. The piece “finished reading the on machine learning” is called a skip-gram with window size 7 (3 + 1 + 3). By using the documents available on the Web, we can easily create hundreds of millions of skip-grams.
让我们像这样表示一个skip-gram:。在我们的句子中,是“完成”的单热向量,对应“阅读”,是被跳过的单词(),是“开”等等。窗口大小为 5 的 Skip-gram 将如下所示:。
Let’s denote a skip-gram like this: . In our sentence, is the one-hot vector for “finished,” corresponds to “reading,” is the skipped word (), is “on” and so on. A skip-gram with window size 5 will look like this: .
具有窗口大小的skip-gram模型示意图如下:
The skip-gram model with window size is schematically depicted below:
它是一个全连接网络,就像多层感知器一样。输入单词表示为在skip-gram中。神经网络必须学习在给定中心词的情况下预测skip-gram的上下文词。
It is a fully-connected network, like the multilayer perceptron. The input word is the one denoted as in the skip-gram. The neural network has to learn to predict the context words of the skip-gram given the central word.
现在您可以明白为什么这种学习被称为自我监督:标记的示例是从未标记的数据(例如文本)中提取的。
You can see now why the learning of this kind is called self-supervised: the labeled examples get extracted from the unlabeled data such as text.
输出层使用的激活函数是softmax。成本函数是负对数似然。当该单词的 one-hot 编码作为模型的输入时,该单词的嵌入将作为嵌入层的输出获得。
The activation function used in the output layer is softmax. The cost function is the negative log-likelihood. The embedding for a word is obtained as the output of the embedding layer when the one-hot encoding of this word is given as the input to the model.
由于 word2vec 模型中的参数数量较多,因此使用了两种技术来提高计算效率:分层 softmax(一种计算 softmax 的有效方法,即将 softmax 的输出表示为二叉树的叶子)和负采样(其中的想法只是更新每次梯度下降迭代的所有输出的随机样本)。我把这些留给进一步阅读。
Because of the large number of parameters in the word2vec models, two techniques are used to make the computation more efficient: hierarchical softmax (an efficient way of computing softmax that consists in representing the outputs of softmax as leaves of a binary tree) and negative sampling (where the idea is only to update a random sample of all outputs per iteration of gradient descent). I leave these for further reading.
哇,太快了!如果您来到这里并能够理解本书的大部分内容,那么您就真的很优秀了。
Wow, that was fast! You are really good if you got here and managed to understand most of the book’s material.
如果你看一下本页底部的数字,你会发现我已经超支了纸张,这意味着这本书的标题有点误导。我希望你能原谅我这个营销小伎俩。毕竟,如果我想让这本书正好有一百页,我可以减小字体大小、白边距和行距,或者删除 UMAP 部分,让你自己处理原始论文。相信我:您不会想独自一人面对 UMAP 上的原始论文! (只是在开玩笑。)
If you look at the number at the bottom of this page, you see that I have overspent paper, which means that the title of the book was slightly misleading. I hope that you forgive me for this little marketing trick. After all, if I wanted to make this book exactly a hundred pages, I could reduce font size, white margins, and line spacing, or remove the section on UMAP and leave you on your own with the original paper. Believe me: you would not want to be left on your own with the original paper on UMAP! (Just kidding.)
然而,现在停下来,我相信你已经具备了成为一名优秀的现代数据分析师或机器学习工程师所需的一切。这并不意味着我涵盖了所有内容,但我在一百多页中涵盖的内容你可以在一堆书中找到,每本书都有一千页厚。我所涵盖的大部分内容根本不在书中:典型的机器学习书籍是保守和学术的,而我强调了那些在日常工作中有用的算法和方法。
However, by stopping now, I feel confident that you have got everything you need to become a great modern data analyst or machine learning engineer. That doesn’t mean that I covered everything, but what I covered in a hundred+ pages you would find in a bunch of books, each a thousand pages thick. Much of what I covered is not in the books at all: typical machine learning books are conservative and academic, while I emphasized those algorithms and methods that you will find useful in your day to day work.
如果这是一本一千页的机器学习书,我到底会涵盖哪些内容?
What exactly would I have covered if it was a thousand-page machine learning book?
在文本分析中,主题建模是一个普遍存在的无监督学习问题。您有一组文本文档,并且您希望发现每个文档中存在的主题。潜在狄利克雷分配(LDA)是一种非常有效的主题发现算法。您可以决定文档集合中存在多少个主题,算法会为该集合中的每个单词分配一个主题。然后,要从文档中提取主题,您只需计算该文档中存在每个主题的单词数即可。
In text analysis, topic modeling is a prevalent unsupervised learning problem. You have a collection of text documents, and you would like to discover topics present in each document. Latent Dirichlet Allocation (LDA) is a very effective algorithm of topic discovery. You decide how many topics are present in your collection of documents and the algorithm assigns a topic to each word in this collection. Then, to extract the topics from a document, you simply count how many words of each topic are present in that document.
高斯过程(GP)是一种与核回归竞争的监督学习方法。它比后者有一些优点。例如,它提供每个点的回归线的置信区间。我决定不解释 GP,因为我找不到简单的方法来解释它们,但你绝对可以花一些时间来了解 GP。这会是一个值得花的时间。
Gaussian processes (GP) is a supervised learning method that competes with kernel regression. It has some advantages over the latter. For example, it provides confidence intervals for the regression line in each point. I decided not to explain GP because I could not figure out a simple way to explain them, but you definitely could spend some time to learn about GP. It will be time well spent.
广义线性模型(GLM) 是线性回归的推广,用于对输入特征向量和目标之间的各种形式的依赖性进行建模。例如,逻辑回归就是 GLM 的一种形式。如果您对回归感兴趣并且正在寻找简单且可解释的模型,那么您绝对应该阅读有关 GLM 的更多内容。
Generalized Linear Model (GLM) is a generalization of the linear regression to modeling various forms of dependency between the input feature vector and the target. Logistic regression, for instance, is one form of GLMs. If you are interested in regression and you look for simple and explainable models, you should definitely read more on GLM.
我在第 7 章中提到了概率图模型(PGM) 的一个例子:条件随机场(CRF)。使用 CRF,您可以将单词的输入序列以及该序列中的特征和标签之间的关系建模为顺序依赖图。更一般地,PGM 可以是任何图。图是由节点和边的集合组成的结构,每个节点和边连接一对节点。 PGM中的每个节点代表一些随机变量(其值可以被观察或不可观察),边代表一个随机变量对另一个随机变量的条件依赖性。例如,随机变量“人行道湿度”取决于随机变量“天气状况”。通过观察一些随机变量的值,优化算法可以从数据中了解观察到的变量和未观察到的变量之间的依赖性。
I have mentioned one example of probabilistic graphical models (PGMs) in Chapter 7: conditional random fields (CRF). With CRF you can model the input sequence of words and relationships between the features and labels in this sequence as a sequential dependency graph. More generally, a PGM can be any graph. A graph is a structure consisting of a collection of nodes and edges that each join a pair of nodes. Each node in PGM represents some random variable (values of which can be observed or unobserved), and edges represent the conditional dependence of one random variable on another random variable. For example, the random variable “sidewalk wetness” depends on the random variable “weather condition.” By observing values of some random variables, an optimization algorithm can learn from data the dependency between observed and unobserved variables.
PGM 允许数据分析师了解一个特征的值如何依赖于其他特征的值。如果依赖图的边是有向的,就可以推断因果关系。不幸的是,手动构建此类模型需要大量的领域专业知识以及对概率论和统计学的深刻理解。后者常常是许多领域专家面临的问题。一些算法可以从数据中学习依赖图的结构,但学习到的模型通常很难被人类解释,因此它们不利于理解生成数据的复杂概率过程。 CRF 是迄今为止最常用的 PGM,主要应用于文本和图像处理。然而,在这两个领域,它们都被神经网络超越了。另一种图形模型,隐马尔可夫模型或 HMM,过去经常用于语音识别、时间序列分析和其他时间推理任务,但是,HMM 再次输给了神经网络。
PGMs allow the data analyst to see how the values of one feature depend on the values of other features. If the edges of the dependency graph are directed, it becomes possible to infer causality. Unfortunately, constructing such models by hand requires a substantial amount of domain expertise and a strong understanding of probability theory and statistics. The latter is often a problem for many domain experts. Some algorithms can learn the structure of dependency graphs from data, but the learned models are often hard to interpret by a human and thus they aren’t beneficial for understanding complex probabilistic processes that generated the data. CRF is by far the most used PGM with applications mostly in text and image processing. However, in these two domains, they were surpassed by neural networks. Another graphical model, Hidden Markov Model or HMM, in the past, was frequently used in speech recognition, time series analysis, and other temporal inference tasks, but, again HMM lost to neural networks.
如果您仍然决定了解有关 PGM 的更多信息,它们也称为贝叶斯网络、信念网络和概率独立网络。
If you still decide to learn more about PGMs, they are also known as Bayesian networks, belief networks, and probabilistic independence networks.
如果您使用图形模型并希望从依赖图定义的非常复杂的分布中采样示例,则可以使用马尔可夫链蒙特卡罗(MCMC) 算法。 MCMC 是一类从数学定义的概率分布中进行采样的算法。请记住,当我们谈论去噪自动编码器时,我们从正态分布中采样噪声。从标准分布(例如正态分布或均匀分布)中采样相对容易,因为它们的属性是众所周知的。然而,当概率分布可以具有由复杂公式定义的任意形式时,采样任务变得更加复杂。
If you work with graphical models and want to sample examples from a very complex distribution defined by the dependency graph, you could use Markov Chain Monte Carlo (MCMC) algorithms. MCMC is a class of algorithms for sampling from any probability distribution defined mathematically. Remember that when we talked about denoising autoencoders, we sampled noise from the normal distribution. Sampling from standard distributions, such as normal or uniform, is relatively easy because their properties are well known. However, the task of sampling becomes significantly more complicated when the probability distribution can have an arbitrary form defined by a complex formula.
生成对抗网络(GAN)是一类用于无监督学习的神经网络。它们被实现为两个神经网络系统,在零和游戏设置中相互竞争。 GAN 最流行的应用是学习生成对于人类观察者来说看起来真实的照片。两个网络中的第一个采用随机输入(通常是高斯噪声)并学习生成像素矩阵形式的图像。第二个网络将两个图像作为输入:来自某些图像集合的一个“真实”图像以及第一个网络生成的图像。第二个网络必须学会识别两个图像中的哪一个是由第一个网络生成的。如果第二个网络识别出“假”图像,第一个网络会得到负损失。另一方面,如果第二个网络无法识别两张图像中哪一张是假的,它就会受到惩罚。
Generative adversarial networks, or GANs, are a class of neural networks used in unsupervised learning. They are implemented as a system of two neural networks contesting with each other in a zero-sum game setting. The most popular application of GANs is to learn to generate photographs that look authentic to human observers. The first of the two networks takes a random input (typically Gaussian noise) and learns to generate an image as a matrix of pixels. The second network takes as input two images: one “real” image from some collection of images as well as the image generated by the first network. The second network has to learn to recognize which one of the two images was generated by the first network. The first network gets a negative loss if the second network recognizes the “fake” image. The second network, on the other hand, gets penalized if it fails to recognize which one of the two images is fake.
遗传算法(GA)是一种数值优化技术,用于优化不可微的优化目标函数。他们使用进化生物学的概念,通过模仿进化生物过程来搜索优化问题的全局最优值(最小值或最大值)。
Genetic algorithms (GA) are a numerical optimization technique used to optimize undifferentiable optimization objective functions. They use concepts from evolutionary biology to search for a global optimum (minimum or maximum) of an optimization problem by mimicking evolutionary biological processes.
GA 的工作方式是从第一代候选解决方案开始。如果我们寻找模型参数的最佳值,我们首先随机生成参数值的多种组合。然后,我们根据目标函数测试参数值的每个组合。将参数值的每个组合想象为多维空间中的一个点。然后,我们通过应用“选择”、“交叉”和“变异”等概念,从上一代点生成下一代点。
GA work by starting with an initial generation of candidate solutions. If we look for optimal values of the parameters of our model, we first randomly generate multiple combinations of parameter values. We then test each combination of parameter values against the objective function. Imagine each combination of parameter values as a point in a multi-dimensional space. We then generate a subsequent generation of points from the previous generation by applying such concepts as “selection,” “crossover,” and “mutation.”
简而言之,这会导致每一代新人都保留更多与上一代中在目标上表现最佳的分数类似的分数。在新一代中,上一代中表现最差的点被表现最好的点的“变异”和“交叉”所取代。点的突变是通过原始点的某些属性的随机扭曲而获得的。交叉是几个点的某种组合(例如平均值)。
In a nutshell, that results in each new generation keeping more points similar to those points from the previous generation that performed the best against the objective. In the new generation, the points that performed the worst in the previous generation are replaced by “mutations” and “crossovers” of the points that performed the best. A mutation of a point is obtained by a random distortion of some attributes of the original point. A crossover is a certain combination of several points (for example, an average).
遗传算法允许找到任何可测量的优化标准的解决方案。例如,遗传算法可用于优化学习算法的超参数。它们通常比基于梯度的优化技术慢得多。
Genetic algorithms allow the finding of solutions to any measurable optimization criteria. For example, GA can be used to optimize the hyperparameters of a learning algorithm. They are typically much slower than gradient-based optimization techniques.
正如我们已经讨论过的,强化学习(RL) 解决了一种非常具体的问题,其中决策是顺序的。通常,有一个代理在未知的环境中行动。每个动作都会带来奖励,并将代理移动到环境的另一种状态(通常是某些具有未知属性的随机过程的结果)。代理的目标是优化其长期奖励。
As we already discussed, reinforcement learning (RL) solves a very specific kind of problem where the decision making is sequential. Usually, there’s an agent acting in an unknown environment. Each action brings a reward and moves the agent to another state of the environment (usually, as a result of some random process with unknown properties). The goal of the agent is to optimize its long-term reward.
强化学习算法,例如 Q-learning 及其基于神经网络的算法,用于学习玩视频游戏、机器人导航和协调、库存和供应链管理、复杂电力系统(电网)的优化以及学习的金融交易策略。
Reinforcement learning algorithms, such as Q-learning, and their neural network based counterparts are used in learning to play video games, robotic navigation and coordination, inventory and supply chain management, optimization of complex electric power systems (power grids), and the learning of financial trading strategies.
书到这里就停了。不要忘记偶尔访问本书的配套 wiki,以随时了解本书中考虑的每个机器学习领域的新发展。正如我在前言中所说,这本书得益于不断更新的wiki,就像好酒在你买了之后变得越来越好一样。
The book stops here. Don’t forget to occasionally visit the book’s companion wiki to stay updated on new developments in each machine learning area considered in the book. As I said in the Preface, this book, thanks to the constantly updated wiki, like a good wine keeps getting better after you buy it.
哦,别忘了这本书是按照先读后买的原则发行的。这意味着,如果您在阅读这些文字时看到数字屏幕上的文本,并且不记得是否已付费购买该书,那么您可能是购买这本书的合适人选。
Oh, and don’t forget that the book is distributed on the read first, buy later principle. That means that if while reading these words you look at text on a digital screen and cannot remember having paid to get it, you are probably the right person for buying the book.
如果没有志愿编辑,这本书就不可能有如此高的质量。我特别感谢以下读者的系统贡献:Martijn van Attekum、Daniel Maraini、Ali Aziz、Rachel Mak、Kelvin Sundli 和 John Robinson。
The high quality of this book would be impossible without volunteering editors. I especially thank the following readers for their systematic contributions: Martijn van Attekum, Daniel Maraini, Ali Aziz, Rachel Mak, Kelvin Sundli, and John Robinson.
我要感谢的其他优秀人士的帮助包括 Michael Anuzis、Knut Sverdrup、Freddy Drennan、Carl W. Handlin、Abhijit Kumar、Lasse Vetter、Ricardo Reis、Daniel Gross、Johann Faouzi、Akash Agrawal、Nathanael Weill、Filip Jekic、 Abhishek Babuji、Luan Vieira、Sayak Paul、Vaheid Wallets、Lorenzo Buffoni、Eli Friedman、Łukasz Mądry、秦浩兰、Bibek Behera、Jennifer Cooper、Nishant Tyagi、Denis Akhiyarov、Aron Janarv、Alexander Ovcharenko、Ricardo Rios、Michael Mullen、Matthew Edwards , 大卫·埃特林, 马诺杰·巴拉吉·J, 大卫·罗伊, 卢安·维埃拉, 路易斯·菲利克斯, 阿南德·莫汉, 哈迪·索图德, 查理·纽维, 扎米尔·阿基姆别科夫, 赫苏斯·雷内罗, 卡兰·加迪亚, 穆斯塔法·阿尼尔·杰尔宾特, JQ Veenstra, Zsolt Kreisz, 伊恩·凯利, 卢卡斯·扎瓦达、玛格达·科瓦尔斯卡、西尔万·普罗诺沃斯特、罗伯特·韦勒姆、托马斯·博斯曼、Lv Steven、阿里尔·罗桑尼戈和卢西亚诺·塞古拉。
Other wonderful people to whom I am grateful for their help are Michael Anuzis, Knut Sverdrup, Freddy Drennan, Carl W. Handlin, Abhijit Kumar, Lasse Vetter, Ricardo Reis, Daniel Gross, Johann Faouzi, Akash Agrawal, Nathanael Weill, Filip Jekic, Abhishek Babuji, Luan Vieira, Sayak Paul, Vaheid Wallets, Lorenzo Buffoni, Eli Friedman, Łukasz Mądry, Haolan Qin, Bibek Behera, Jennifer Cooper, Nishant Tyagi, Denis Akhiyarov, Aron Janarv, Alexander Ovcharenko, Ricardo Rios, Michael Mullen, Matthew Edwards, David Etlin, Manoj Balaji J, David Roy, Luan Vieira, Luiz Felix, Anand Mohan, Hadi Sotudeh, Charlie Newey, Zamir Akimbekov, Jesus Renero, Karan Gadiya, Mustafa Anıl Derbent, JQ Veenstra, Zsolt Kreisz, Ian Kelly, Lukasz Zawada, Magda Kowalska, Sylvain Pronovost, Robert Wareham, Thomas Bosman, Lv Steven, Ariel Rossanigo and Luciano Segura.